Rで何かをしたり、読書をするブログ

政府統計の総合窓口のデータや、OECDやUCIやのデータを使って、Rの練習をしています。ときどき、読書記録も載せています。

World Bank Income share held by lowest 20% data analysis 5 - theoretical base regression analysis vs. simulation base regression analysis.

Generated by Bing Image Creator: Large tree, tiny flowers, green grass, blue sky and white clouds, morning, photo

www.crosshyou.info

This post is follwoing of the previous post.

In this post, I will do regression analysis, with traditional(theoretical based) way and modern(simulation based) way.

Before doing analysis, I load "infer" package.

Let's start with traditional(theoretical based) linear regression analysis.

I use lm() function.

In traditional way, we need to check heteroskedasticiy. I use bptest() function on lmtest package.

p-value is greater than 0.05, so there is not heteroskedasticity.

Let's get confidence interval.

95% confidence intarval is -0.322 to -0.119, which does not include 0, so, Y2006 is negative relationship to Chg_Net. 
Let's make a plot.

Next, I will do modern(simulation based) regression analysis.

First, I get observed intercept and slope.

Intercept is 1.91, slope is -0.221, which are the same as traditional way.

Since simulation based analysis does not require to check heteroskedasticity, I don't do it.

Next, I make bootstrap simulation.

Then, I calculate confidence interval.

95% Confidence Interval for Y2006 is -0.303 to -0.140, they are slightly different from traditional confidence interval.

Let's visualize the both confidence intervals with boot strap distribution.

Red vertical lines are traditional(theoretical based) confidence interval, green lines are modern(simulation based) confidence interval.

We see modern(simulation based) confidence interval is narrower than tradional(teoretical based) confidence interval.

That's it. Thank you!

To read from the first post,

www.crosshyou.info

World Bank Income share held by lowest 20% data analysis 4 - Visualize change net and change percent by region, by income group.

Generated by Bing Image Creator: Joyful landscape view of green grass field, photo

 

www.crosshyou.info

This post is following of the above post.

In the previous post, I calculated change net and change percent between 2006 and 2008.

Let's visualize them.

First, let's make a histogram for change net.

I used stat_bin() function to add number of observations to a histogram.

Next, let's make change percent histogram.

The both change net and change percent are like normal distibution with right skewed.

Next, let's see regional characteristics.

First, I use group_by() function and suumarize() fuction to calculate average values by region, then, I use geom_point(), geom_text() function to draw a plot and text.

Latain America & Carribbean region is the most changed region.

How about by income?

Lower middle income is the most changed income group and High Income is the least changed group.

Let's see which country ihas the largest change.

Moldova has the largest change net.

Bolivia has the largest change percent.

Which country has the largest negative change?

Luxembourg has has the largest negative change net.

Luxembourg has the largest nagative change percent.

Which country has the highest lowest 20% shcare?

Czechia and Moldova have the largest share in 2018.

Slovenia has the highest share in 2006.

Lastly, which country has the lowest share?

Brazil has the lowest share of lowest 20% in 2018.

Honduras has the lowest share in 2006.

That's it. Thank you!

Next post is

www.crosshyou.info

 

To read from the first post,

www.crosshyou.info

World Bank Income share held by lowest 20% data analysis 3 - Using pivot_wider() to see net change from 2006 to 2018.

Generated by Bing Image Creator: Long scale view of green forest, blue sky, cosmos flowers, photo

 

www.crosshyou.info

This post is following of the above post.

Let's see which year has the most observations. I use count() function and arrange() function and desc() function.

Year 2018 has the most observations, 2012 and 2015 are the 2nd.

Let's see those three years boxplots.

I don't see any large differences.

Next, let's see which country has the most observations.

USA has the most, GBR has the 2nd most and CAN are the 3rd.
Let's make a chart for those three countries.

These three countires have different patterns. USA has almost always the smallest and the levels keep between 5 and 6.

Canada and GBR are inverse trending before 1990. then they goes together.

 

In the above results of observations by year, the oldest year is 2006 and the newest yeat is 2018. So, I use these two years.

I delete NA rows.

Let's calcukate change net from Y2006 to Y2018.

Let's see summary statistics.

I see Chg_Net and Chg_Pct mean and median are positive.

So from 2006 to 2018, lowest 20% share were increased overall.

That's it. Thank you!.

Next post is

www.crosshyou.info

 

To read from the first post,

www.crosshyou.info

World Bank Income share held by lowest 20% data analysis 2 - Some data visualization with R

Generated by Bing Image Creator: Wide view of clear small creek, water flowers, photo

www.crosshyou.info

This post is following of the above post.
In the previous post, I import data into R and made a tidy dataframe which can be analyzed.

So, let's analyze!

First, I use summary() function.

For year, the oldest year is 1963 and the newest year is 2023.

For share, the smalled is 0.8, the largest is 11.7, the everage is 6.616.

Let's make a histogram for share. I use ggplot2 package.

I added the red vertical line with geom_vline(), which is at 6.616, average value.

Next, let's make line chart.

Oh? There is some observations which is NA for income.

Let's check it.

I see Venezuela, RB has NA for income.
So, I delete it.

Again, let's see summar().

Now, I have 2094 observations, the average share is 6.634.

Let's make boxplots for share by year.

I don't see there is clear trend.

That's it. Thank you!

Next post is

www.crosshyou.info

 

To read from the first post,

www.crosshyou.info

World Bank Income share held by lowest 20% data analysis 1 - import data into R from CSV file and make "tidy" data frame.

Generated by Bing Image Creator: Wide-shot of mountain flower gardens, photo

In this post, I will analyze World Bank's "Income share held by 20%" data with R.

From the website, Income share held by lowest 20% | Data (worldbank.org)

I got below two CSV files.

above is data file.

Above is meta data file.

I will import those data into R.

First, I load "tidyverse" package, which is one of the greatest packages in R.

Then, I use read_csv() function to import CSV file data into R.

df_raw is raw data data frame, df_meta is meta data data frame.

SInce df_raw is not tidy data frame, I will use pivot_longer() function to convert it into tidy data frame.

Next, I delete `Indicatoe Name` and `Indicator Code` because they are same value for all observations. I use select() function.

Next, I will change `Country Name` and `Country Code` into name and code. I use rename() function.

Next, I will change year to numeric value. I use parse_number() function in mutate() function.

Then, I delete NA observations. I use na.omit() function.

Then, I will merge df and df_meta. I use inner_join() function

All right!

Finally, I got a data frame, which have 2107 observations with 6 variables.

That's it. Thank you!

Next post is

www.crosshyou.info

 

読書記録 - 「零の発見: 数学の生い立ち」 - 吉田 洋一 著 (岩波新書)

私が生まれる前に出版され、今でも刷を重ねている新書の世界のロングセラーですね。

零の発見、という題の文章と

直線を切る、という題に文章の2つがあります。

アラビア人(エジプト人だったかも)は実用を重んじていたので、円と同じ面積を持つ正方形は、円の半径の10分の9の長さの辺の正方形、ということで間に合わせていたのに、ギリシャ人は思考の世界を重んじていたので、そんな精度では満足できずに、円周率を計算して求めようとしていた、など面白い話がいっぱいでした。

UCI Machine Learning Repository の Raisin のデータ分析6 - ニューラルネットワークによる判別、正解率は86.3%

Bing Image Creator で生成: Wideshot view of Japanese forest, photo

www.crosshyou.info

の続きです。

今回は、Rのnnetパッケージでニューラルネットワークによる判別をしてみます。

まずは、nnetパッケージを読み込みます。

nnet()関数でニューラルネットワークのモデルをフィットします。

size = 3 としていますが、これはニューラルネットワークの隠れ階層が3ということです。

predict()関数で予測します。

結果をみてみましょう。

正解率を計算します。

86.3%でした。今までの中で一番高い正解率でした。

Confusion Matrixも作成しておきます。

今回で UCI Machine Learning Repository の Raisin のデータでの判別は終わりにしようと思います。

いろいろな手法をたしかめました。

デタラメ判別は、56%

決定木モデルは、83%

サポートベクターマシーンは、84%

LASSO回帰は、83.7%

ニューラルネットワークは、86.3%

でした。

決定木モデルとLASSO回帰は、どうやって判別したのかがわかるけど、サポートベクターマシーンとニューネットワークはわからない。

決定木モデルとLASSO回帰は正解率がサポートベクターマシーン、ニューラルネットワークよりは低い。

という結果でした。

決定木モデルとLASSO回帰は正解率は低いですが、とても低いということではなくて、今回くらいの差であれば、どうやって判別したかわかるこの2つの方法のほうがいいのかな?と思いました。

初めから読むには、

www.crosshyou.info

です。