Rで何かをしたり、読書をするブログ

政府統計の総合窓口のデータや、OECDやUCIやのデータを使って、Rの練習をしています。ときどき、読書記録も載せています。

Kaggle's Gym Members Exercise Dataset Analysis with R 5 - Linear Regression and Tree Model Regression to Forecast Calories

Generated by Bing Image Creator: Long landscape view of photograph, wind breezing grass field under the blue sky

 

www.crosshyou.info

This post is following of the above post.

This morning, I will do linear regression and tree model regression to forecast calories.

First, let's re make calories histogram to remember its distribution.

calories seems normal distribution.

Before forecasting, I will divide dataset into training dataset and test dataset.

Let's check if train_set's calories and test_set's calories differ or not.

p-value is 0.771, so I can say there is not statistically significant difference.

Then, let's to linear regression.

p-value in the bottom line is less than 2.2e-16.

I use step() function to remove not important variables.

age and Female have negative effect, weight, height, avgbpm, hours and bmi have possitive effect to calories. That is understandable to common sence.

I use predict() function to predict calories for test_set.

Let's make a scatter plot for predicted calories and actual calories.

If predicted and actual are exactly same, all points should be on the red line. I think linear regression did good job.

I calculate RMSE(Root Mean Squared Error).

It is very good.

average difference is 4.3%.

Next, let's do tree model regression.

I load rpart package.

then, I use rpart() function.

I find the best CP.

I prune the rpart_mod with best_CP.

To make a regression tree graph, I load rpart.plot package and use rapart.plot() function.

Oh!, it is very complicated tree.

Anyway, l use predict() function.

Let's make scatter plot.

It seems worse than linear regression scatter plot.

How about RMSE?

RMSE of treemodel regression is 75.51, 8.3% of average of calories.

That's it. Thank you!

Next post is

www.crosshyou.info

 

To read from the first post,

www.crosshyou.info

 

This post code is below.

#
# calories distribution
ggplot(gym_raw, aes(x = calories)) +
  geom_histogram(color = "white")
#
# divide gym_raw into training and test set
set.seed(1)
divide_index <- sample(1:nrow(gym_raw), 0.6 * nrow(gym_raw),
                       replace = FALSE)
train_set <- gym_raw[divide_index, ]
test_set <- gym_raw[-divide_index, ]
#
# compare both calories
t.test(train_set$calories, test_set$calories)
#
# First model: linear regression
lm_mod <- lm(calories ~ ., data = train_set)
summary(lm_mod)
#
# delete non important variables
lm_mod <- step(lm_mod, trace = FALSE)
summary(lm_mod)
#
# predict with test_set
lm_pred <- predict(lm_mod, newdata = test_set)
#
# actual vs predict
plot(lm_pred, test_set$calories)
abline(a = 0, b = 1, col = "red")
#
# calculate RMSE
lm_rmse <- (lm_pred - test_set$calories)^2 |> 
  mean() |> 
  sqrt()
lm_rmse
#
# RMSE vs. mean calories
lm_rmse / mean(gym_raw$calories)
#
# load rpart package
library(rpart)
#
# make regression tree
set.seed(1)
rpart_mod <- rpart(calories ~ .,
                   data = train_set,
                   minsplit = 3,
                   cp = 1e-8)
#
# which is the best CP?
min_row <- which.min(rpart_mod$cptable[ , "xerror"])
best_cp <- rpart_mod$cptable[min_row, "CP"]
best_cp
#
# prune the model
rpart_mod <- prune(rpart_mod, cp = best_cp)
#
# load rpart.plot package
library(rpart.plot)
#
# plot the model
rpart.plot(rpart_mod,
           type = 1)
#
# predict
rpart_pred <- predict(rpart_mod, newdata = test_set)
#
# plot prediction and actial
plot(rpart_pred, test_set$calories)
abline(a = 0, b = 1, col = "red")
#
# calculate RMSE
rpart_rmse <- (rpart_pred - test_set$calories)^2 |> 
  mean() |> 
  sqrt()
rpart_rmse
#
# RMSE / mean
rpart_rmse / mean(gym_raw$calories)
#