

World Bank Income share held by lowest 20% data analysis 5 - theoretical base regression analysis vs. simulation base regression analysis.

This post is follwoing of the previous post.

In this post, I will do regression analysis, with traditional(theoretical based) way and modern(simulation based) way.

Before doing analysis, I load "infer" package.

Let's start with traditional(theoretical based) linear regression analysis.

I use lm() function.

In traditional way, we need to check heteroskedasticiy. I use bptest() function on lmtest package.

p-value is greater than 0.05, so there is not heteroskedasticity.

Let's get confidence interval.

95% confidence intarval is -0.322 to -0.119, which does not include 0, so, Y2006 is negative relationship to Chg_Net. 
Let's make a plot.

Next, I will do modern(simulation based) regression analysis.

First, I get observed intercept and slope.

Intercept is 1.91, slope is -0.221, which are the same as traditional way.

Since simulation based analysis does not require to check heteroskedasticity, I don't do it.

Next, I make bootstrap simulation.

Then, I calculate confidence interval.

95% Confidence Interval for Y2006 is -0.303 to -0.140, they are slightly different from traditional confidence interval.

Let's visualize the both confidence intervals with boot strap distribution.

Red vertical lines are traditional(theoretical based) confidence interval, green lines are modern(simulation based) confidence interval.

We see modern(simulation based) confidence interval is narrower than tradional(teoretical based) confidence interval.

That's it. Thank you!

To read from the first post,


World Bank Income share held by lowest 20% data analysis 4 - Visualize change net and change percent by region, by income group.

This post is following of the above post.

In the previous post, I calculated change net and change percent between 2006 and 2008.

Let's visualize them.

First, let's make a histogram for change net.

I used stat_bin() function to add number of observations to a histogram.

Next, let's make change percent histogram.

The both change net and change percent are like normal distibution with right skewed.

Next, let's see regional characteristics.

First, I use group_by() function and suumarize() fuction to calculate average values by region, then, I use geom_point(), geom_text() function to draw a plot and text.

Latain America & Carribbean region is the most changed region.

How about by income?

Lower middle income is the most changed income group and High Income is the least changed group.

Let's see which country ihas the largest change.

Moldova has the largest change net.

Bolivia has the largest change percent.

Which country has the largest negative change?

Luxembourg has has the largest negative change net.

Luxembourg has the largest nagative change percent.

Which country has the highest lowest 20% shcare?

Czechia and Moldova have the largest share in 2018.

Slovenia has the highest share in 2006.

Lastly, which country has the lowest share?

Brazil has the lowest share of lowest 20% in 2018.

Honduras has the lowest share in 2006.

That's it. Thank you!

Next post is



To read from the first post,


World Bank Income share held by lowest 20% data analysis 3 - Using pivot_wider() to see net change from 2006 to 2018.

This post is following of the above post.

Let's see which year has the most observations. I use count() function and arrange() function and desc() function.

Year 2018 has the most observations, 2012 and 2015 are the 2nd.

Let's see those three years boxplots.

I don't see any large differences.

Next, let's see which country has the most observations.

USA has the most, GBR has the 2nd most and CAN are the 3rd.
Let's make a chart for those three countries.

These three countires have different patterns. USA has almost always the smallest and the levels keep between 5 and 6.

Canada and GBR are inverse trending before 1990. then they goes together.


In the above results of observations by year, the oldest year is 2006 and the newest yeat is 2018. So, I use these two years.

I delete NA rows.

Let's calcukate change net from Y2006 to Y2018.

Let's see summary statistics.

I see Chg_Net and Chg_Pct mean and median are positive.

So from 2006 to 2018, lowest 20% share were increased overall.

That's it. Thank you!.

Next post is



To read from the first post,


World Bank Income share held by lowest 20% data analysis 2 - Some data visualization with R

This post is following of the above post.
In the previous post, I import data into R and made a tidy dataframe which can be analyzed.

So, let's analyze!

First, I use summary() function.

For year, the oldest year is 1963 and the newest year is 2023.

For share, the smalled is 0.8, the largest is 11.7, the everage is 6.616.

Let's make a histogram for share. I use ggplot2 package.

I added the red vertical line with geom_vline(), which is at 6.616, average value.

Next, let's make line chart.

Oh? There is some observations which is NA for income.

Let's check it.

I see Venezuela, RB has NA for income.
So, I delete it.

Again, let's see summar().

Now, I have 2094 observations, the average share is 6.634.

Let's make boxplots for share by year.

I don't see there is clear trend.

That's it. Thank you!

Next post is



To read from the first post,


World Bank Income share held by lowest 20% data analysis 1 - import data into R from CSV file and make "tidy" data frame.

In this post, I will analyze World Bank's "Income share held by 20%" data with R.

From the website, Income share held by lowest 20% | Data (worldbank.org)

I got below two CSV files.

above is data file.

Above is meta data file.

I will import those data into R.

First, I load "tidyverse" package, which is one of the greatest packages in R.

Then, I use read_csv() function to import CSV file data into R.

df_raw is raw data data frame, df_meta is meta data data frame.

SInce df_raw is not tidy data frame, I will use pivot_longer() function to convert it into tidy data frame.

Next, I delete `Indicatoe Name` and `Indicator Code` because they are same value for all observations. I use select() function.

Next, I will change `Country Name` and `Country Code` into name and code. I use rename() function.

Next, I will change year to numeric value. I use parse_number() function in mutate() function.

Then, I delete NA observations. I use na.omit() function.

Then, I will merge df and df_meta. I use inner_join() function

All right!

Finally, I got a data frame, which have 2107 observations with 6 variables.

That's it. Thank you!

Next post is



読書記録 - 「零の発見: 数学の生い立ち」 - 吉田 洋一 著 (岩波新書)





UCI Machine Learning Repository の Raisin のデータ分析6 - ニューラルネットワークによる判別、正解率は86.3%

size = 3 としていますが、これはニューラルネットワークの隠れ階層が3ということです。





Confusion Matrixも作成しておきます。

今回で UCI Machine Learning Repository の Raisin のデータでの判別は終わりにしようと思います。













