www.crosshyou.info

政府統計の総合窓口のデータや、OECDやUCIやのデータを使って、Rの練習をしています。ときどき、読書記録も載せています。

OECD Young self-employed data analysis 4 - data visualization again and combining GDP data.

Photo by Fatih Yürür on Unsplash 

www.crosshyou.info

This post is following of above post.

In the previous post, I made a new data frame, df_new, which has mem variable and women variable.

Let's visualize those data.

Firstly, men data by LOCATION

JPN has the lowest and ZAF has the highest.

Next, women by LOCATION

JPN has the lowest, MEX has the highest. JPN has the lowest value for the both men and women.

Then, let' see men by TIME.

I don't see there is clear trend.

Next, women by TIME.

For women young self-employed percent, I don't see clear up or down trend too.

Let's check which TIME has the most many observations.

I see 2021, 2013 and 2015 have the most many observations, each TIME has 26 observations.

By the way, I have other CSV data file which has GDP and per capita GDP data like below.

Let's add this data to our data analysis.

Then, I use 2012 and 2015 data only and combine gdp data and younf self-eompoyed data.

This df_comb has 52 observations and 6 variables.

Let's analyze relationship between GDP data and young self-employed data in next post.

That's it. Thank you!

Next post is

 

www.crosshyou.info

 

To read the 1st post,

 

www.crosshyou.info

 


 

OECD Young self-employed data analysis 3 - Comparing men and women young data. men has higher ration than women.

Photo by RoonZ nl on Unsplash 

www.crosshyou.info

This post is following of above post.

In this post, let's compare men self-employed and women self-employed.
Firstly, I make two vectors, "men" and "women"

Let's see both summary statistics.

men average is 8.4 and women average is 5.7.

Let's check if the both variance are same ot not. I use var.test() function.

p-value is 1.134e-08, which is almost 0. so variance is not same.

Next, let's check if distribution location is same. I use wilcox.test() function.

p-value is less than 2.2e-16, so men and womrn location is not equal.

So far, I know men and women are different characeristics so, it is better to analyze men and women in separate.

I make a new data frame, which has both men and women variables.

Let's see summary() function results.

Now, we see men's avegrage is 9.21 and women's average is 5.73.

Let's check var.test() for if varaince is same or not.

We see small enough p-value, so men variance and women variance is not same.

How about location?

p-value is almost 0, so men and women are different location shift.

I can say, in general men young self-employed ratio is higher than women uoung self-employed ratio.

Let's see scatter plot.

men and women has positive correlation.

That's it. Thank you!

Next post is

 

www.crosshyou.info

 



To read the first post,

 

www.crosshyou.info

 





 

OECD Young self-employed data analysis 2 - data visualization with ggplot2 package, geom_point(), geom_line(), geom_boxplot and geom_histogram() and geom_bar().

Photo by CHUTTERSNAP on Unsplash 

www.crosshyou.info

This posit is following of above post.
Let's see data on some graphs.

I use ggplot2 package which is included in tidyverse package.

gerom_point() function makes scatter plot. I see men has higher self-employed percentage than women in general.

Next, let's see which LOCATION has the higest/lowest value.

JPN has extremely low percentge and ZAF hs the higest percentage.

Next, let's see trend line by LOCATION

Next, let's see women trend

I don't see typical time trend.

Next, let's see histograms.

mem's value has wider spread.

Next, let's see which LOCATION has the largest observations.

NLD, ITA, GRC and ESP hve the largest observations.

In this post, I use ggplot2 package to vizualization.
geom_point() for scatter plot,

geom_line() for line chart,

geom_boxplot() for box plot,

geom_histogram() for histogram and

geom_bar() for bar chart.

That's it. Thank you!

Next post is

 

www.crosshyou.info

 

To read the first post,

 

www.crosshyou.info

 

OECD Young self-employed data analysis 1 - Read CSV file using R

Photo by Slawek K on Unsplash 

In this post, I will analyze OECD Young self-employed data. This is the sare of self-employed aged 20-29 among all employed worksers aged 20-29 in this group.

 The CSV file which I download from OECD web site is like below.

Let's analyze this data with R.

Firstly, I load tidyverse package.

Then, I use read_csv() function to read the CSV file.

In the CSV file, there are LOCATION, INDICATOR, SUBJECT, MEASURE, FREQUENCy, TIME and Value. Let's check each variables.

There is not NA for LOCATION and CAN has the most frequent value, 56.

There is not NA for INDICATOR and INDICATOR has only one value, YOUNGSELF. So I can ignore INDICATOR.

There is not NA for SUBJECT and there are two kinds of value, 20_20_WOMEN and 20_29_MEN. I will change it women and men later.

There is not NA for MEASURE and there is only one value, PC_TOTEMP. So I can ignore this.

There is not NA for FREQUENCY and there is only one value, A, so I can ignore this.

There is not NA for TIME. 2013 and 2015 has the largest observations, 58.

There is not NA for Value and Value is distributed right skewed.

Now, I will make data frame for analysis. I will delete INDICATOR, MAJOR and FREQUENCY, I will use TIME which has more than 30 LOCATIONs only , it meand from 1996 and I will change SUBJECT value to women and men.

That's it today.
Thank you!

Tne next post is

 

www.crosshyou.info

 

 

読書記録 - 「SDGs --- 危機の時代の羅針盤」 南博 & 稲葉雅紀 著 岩波新書

 

この本を読んで思ったのは、

SDGs は現在の人間世界が直面している危機のカタログみたいなものだ、ということです。

17のゴールとして2030年までに達成しなければならない目標が掲げられていますが、逆に言えばこれらは現在できていない、ということなのですね。

現在の人間社会は地球1.69個分の資源等を消費してしまっているそうです。

 

 

全国統一の小売物価統計のデータ分析6 - Serial Correlation の有無を調べる。AR(1)の系列相関テストと、ダービン・ワトソン検定

Photo by Al Pangestu on Unsplash 

www.crosshyou.info

前回はdynlmパッケージのdynlm()関数を使って、時系列データの回帰分析をしました。

時系列データの回帰分析では、系列相関(Serial Correlation)があると上手く分析できませんので、今回は前回の回帰分析がSerial Correlation がるかどうかを調べます。

まず、resid()関数を使って残差を保存しておきます。

グラフにしてみます。

3つとも似たような残差ですね。

3つともp値は0.05よりも大きく、有意ではないので、Serial Correlationは無いです。

ダービン・ワトソン検定もしています。lmtestパッケージのdwtest()関数を使います。

3つともp-valueは0.05よりも大きく、系列相関は0、という帰無仮説を棄却できないです。

dwtest()関数は、dynlm()関数で作成した分析モデルをそのまま当てはめるだけなので簡単ですね。

今回は以上です。

初めから読むには、

 

www.crosshyou.info

です。

 

 

全国統一の小売物価統計のデータ分析5 - dynlmパッケージのdynlm()関数で時系列データの回帰分析

Photo by Allyson Beaucourt on Unsplash 

www.crosshyou.info

の続きです。

前回の分析で、年や月は統計的に有意な影響を価格には及ぼしていないことがわかりました。

今回は一番高い価格の外車のデータに絞って、時系列分析をしてみます。

まず、外車だけのデータセットを作ります。

name_code, nameは不要なので、削除します。

ts()関数で時系列データに変換します。

arrange()関数でtime_codeを昇順に並びかえてから、ts()関数で時系列データを作りました。

時系列データでplot()関数を実行すると、

とこのように簡単に時系列のチャートになります。

時系列データを回帰分析するためのdynlmパッケージの読み込みをします。

とりあえず、今月の価格 = b0 + b1*前月の価格 + u というモデルを分析してみましょう。

dynlmパッケージのdynlm()関数でOLSの回帰分析ができます。その際にL()というのを使うと、一つ前のデータになります。

係数がe+04などがついていて見にくいので、見やすいように表示してみます。

切片が90486, 傾きが0.97875とありますので、モデル式は、

今月の価格 = 90486 + 0.97875 * 前月の価格 + 誤差項

ということですね。

dynlm()関数では、trend()というのを使うと時系列データのトレンドを考慮することができます。

トレンドのp値は0.609とありますので、有意な係数ではないようです。

e+03などを使わないで各変数の係数の値を見ましょう。

season()を使うと季節性を考慮できます。

季節、各月のp値は0.05よりも大きいですね。各月も価格には有意ではないようです。

これは、前回の分析(年や月は有意ではない)と整合していますね。

切片と傾きの係数をe+04などを使わないで表示します。

最後に3つのモデルの傾きの係数を並べて表示してみます。

今回は以上です。

次回は

 

www.crosshyou.info

です。

初めから読むには、

 

www.crosshyou.info

です。