Rで何かをしたり、読書をするブログ

政府統計の総合窓口のデータや、OECDやUCIやのデータを使って、Rの練習をしています。ときどき、読書記録も載せています。

Kaggle's Gym Members Exercise Dataset Analysis with R 6 - Logistic Regression and Decision Tree to see Male or Female.

Generated by Bing Image Creator: Long-wide landscape view of large fall and river, flowering dandelions, photo 

www.crosshyou.info

This post is follwong of the above post.

In this post, I will to classification practice, to see Male or Female.

First, I make a new dataframe, which has dummy variable(1 or 0 value) for Male.

Then, I make training set and test set.

Before doing classification practice, let's check if there is significant difference between trainign set and test set for male proportion.

p-value is 0.4978, so I cannot reject gym_train and gym_test has the same male proportion.

All right, let's make a logit model.

I remove non important variables.

Lager age, height, calories, water, bmi implies Male.

Let's predict.

Let's make a confusion matrix using caret package to see how well the logit model predict.

The logit model predict correctly at 97.44% accuracy.

Among 390 observations, only 10 mistaked.

Next, let's make a decision tree model.

Find the best CP.

Prune the tree model with best CP

Draw a decision tree graph.

The decision tree uses water, weight, height, fatpct and hours.

Prediction with test set.

Let's see confusion matrix.

Accuracy is 96.92%.

That's it. Thank you!

To read from the first post,

www.crosshyou.info

 

This post code is below.

#
# make dummy variable for Male
gym_gender <- gym_raw |> 
  mutate(male = if_else(gender == "Male", 1, 0)) |> 
  select(-gender)
glimpse(gym_gender)
#
# divide training set and test set
gym_train <- gym_gender[divide_index, ]
gym_test <- gym_gender[-divide_index, ]
#
# compare train and test for male
t.test(gym_train$male, gym_test$male)
#
# Logit model
logit_mod <- glm(male ~ ., data = gym_train,
                 family = binomial)
summary(logit_mod)
#
# remove non important variable
logit_mod <- step(logit_mod, trace = FALSE)
summary(logit_mod)
#
# predict
logit_pred <- predict(logit_mod, newdata = gym_test,
                      type = "response")
logit_pred <- if_else(logit_pred > 0.5, 1, 0)
#
# Confusion Matrix
caret::confusionMatrix(factor(logit_pred),
                       factor(gym_test$male))
#
# make decision tree
set.seed(333)
dec_tree <- rpart(male ~ .,
                  data = gym_train,
                  method = "class",
                  minsplit = 3,
                  cp = 1e-8)
#
# best CP
min_row <- which.min(dec_tree$cptable[ , "xerror"])
best_CP <- dec_tree$cptable[min_row, "CP"]
best_CP
#
# prune the tree
dec_tree <- prune(dec_tree, cp = best_CP)
#
# Decision Tree Graph
rpart.plot(dec_tree,
           type = 1)
#
# prediction
tree_pred <- predict(dec_tree, newdata = gym_test,
                        type = "class")
#
# Confusion Matrix
caret::confusionMatrix(factor(tree_pred),
                       factor(gym_test$male))
#