Rで何かをしたり、読書をするブログ

政府統計の総合窓口のデータや、OECDやUCIやのデータを使って、Rの練習をしています。ときどき、読書記録も載せています。

Kaggle's Gym Members Exercise Dataset Analysis with R 4 - Statistical Inference with infer package: 2-Sample mean t-Test

Generated by Bing Image Creator: Light blue forest landscape, yellow and green flowers, photo
www.crosshyou.info

This post is following of the above post.

In this post, I will do statistical inference with infer package.

I refere to Tidy Statistical Inference • infer

In the previous post, I make a below graph.

The graph shows bmi is different by gender, but there are many overlaps.

So, let's do statistical inference.

First, I load infer package.

Then, I will follow 2-Sample t-Test in  Tidy t-Tests with infer • infer article.

Let's begin with calculating observed statistic.

The bmi mean difference by gender is 4.16. It can be easily confirmed below.


Next, I sill generate null distribution.

Then, visualize null distribution and observed statistic.

You'll see our observed statistic is far away from the null distribution.

Let's calculate p-value.

I can say mean bmi differs by gender.

Next, let's calculate confidence interval.

Then, I calculate confidence interval.

Finally, I visualize bootstrap distribution and confidence interval.

After doing above statistical inference, I am 99% confident bmi is differ about 3.10 to 5.10 by gender and I am alomost 100% sure bmi differs by gender.

That's it. Thank you!

Next post is

www.crosshyou.info

 

To read from the first post,

www.crosshyou.info

I used below code in this post.

#
# load infer package
library(infer)
#
# Calculate the observed statistic
observed_statistic <- gym_raw |> 
  specify(bmi ~ gender) |> 
  calculate(stat = "diff in means", order = c("Male", "Female"))
observed_statistic
#
# confirm difference
gym_raw |> 
  group_by(gender) |> 
  summarize(mean_bmi = mean(bmi))
26.9 - 22.7
#
# Next, I will generate the null distribution with randomization
set.seed(123)
null_dist_2_sample <- gym_raw |> 
  specify(bmi ~ gender) |> 
  hypothesize(null = "independence") |> 
  generate(reps = 1000, type = "permute") |> 
  calculate(stat = "diff in means", order = c("Male", "Female"))
#
# Visualize the null distribution and test statistic
null_dist_2_sample |> 
  visualize() +
  shade_p_value(observed_statistic,
                direction = "two-sided")
#
# Calculate p-value
p_value_2_sample <- null_dist_2_sample |> 
  get_p_value(obs_stat = observed_statistic,
              direction = "two-sided")
p_value_2_sample
#
# generate bootstrap distribution
set.seed(123)
boot_dist <- gym_raw |> 
  specify(bmi ~ gender) |> 
  generate(reps = 1000, type = "bootstrap") |> 
  calculate(stat = "diff in means", order = c("Male", "Female"))
#
# calculate confidence interval 99% level
percentile_ci <- get_confidence_interval(boot_dist,
                                         level = 0.99)
percentile_ci
#
# visualize bootstrap distribution with confidence interval
boot_dist |> 
  visualize() +
  shade_confidence_interval(endpoints = percentile_ci)
#