Rで何かをしたり、読書をするブログ

政府統計の総合窓口のデータや、OECDやUCIやのデータを使って、Rの練習をしています。ときどき、読書記録も載せています。

OECD Gender wage gap data analysis 1 - Load CSV file data into R

f:id:cross_hyou:20210829083908j:plain

Photo by Trevor McKinnon on Unsplash

f:id:cross_hyou:20210829083620p:plain

In this post, I will analyze OECD Gender wage gap data.

From the OECD web site, I downloaded the CSV data file like below.

f:id:cross_hyou:20210829084202p:plain

I will use R to analyze this data.

First, I load tidyverse packages

f:id:cross_hyou:20210829084408p:plain

Then, I use read_csv() function to load CDV data into R.

f:id:cross_hyou:20210829084603p:plain

Let's check each variables.

First, LOCATION

f:id:cross_hyou:20210829084757p:plain

There are many locations, GBR has the most observations, 65. HRV has the least observations, 3.

INDICATOR

f:id:cross_hyou:20210829085034p:plain

INDICATORS ha only one value, WAGEGAP. so I drop this variable from df.

f:id:cross_hyou:20210829085203p:plain

SUBJECT

f:id:cross_hyou:20210829085404p:plain

For SUBJECT, there are two subjects, one is employee and the other is selfemployed.

MEASURE,

f:id:cross_hyou:20210829085705p:plain

There is only one value in MEASURE, so I will drop MEASURE.

f:id:cross_hyou:20210829085840p:plain

FREQUENCY

f:id:cross_hyou:20210829090011p:plain

There is only one value:A in FREQUENCY, so I will drop FREQUENCY from df.

f:id:cross_hyou:20210829090142p:plain

TIME

f:id:cross_hyou:20210829090234p:plain

TIME is numerical data. The minimum is 1970, the maximum is 2020. Mean is 2007.
There is no NA.

Value

f:id:cross_hyou:20210829090421p:plain

Value is numerical data. There is no NA. The minimum is -30.38, The maximum is 63.20.

Flag Codes

f:id:cross_hyou:20210829090700p:plain

Flag Codes has only one value, B. So I will drop it.

f:id:cross_hyou:20210829090854p:plain

All right, let's see df with glimpse() function.

f:id:cross_hyou:20210829091027p:plain

Now, we know there are EMPLOYEE and SELFEMPLOYED in subject.

Let's make two subset data frame, one is for EMPLOYEE only and the other is SELFENPLOYED only.

f:id:cross_hyou:20210829091510p:plain

Let's merge these two data fram with inner_join() function.

f:id:cross_hyou:20210829091812p:plain

Let's change Value.x to emp, Value.y to self. Also, let's change other variables, LOCATION to country, SUBJECT.X to x, TIME to year, SUBJECT.y to y.

f:id:cross_hyou:20210829092155p:plain

I will drop x and y.

f:id:cross_hyou:20210829092310p:plain

Let's change country to factor type.

f:id:cross_hyou:20210829092645p:plain

 

All right.
Let's see summary of df2.

f:id:cross_hyou:20210829092745p:plain

We see NZL has the most observations. year starts from 1998 to 2019. The minimum emp is -3.13 and the maximum emp us 23.5. The minimum self is -30.38 and the maximum self is 63.20.

That's it. Thank you!

Next post is...

 

www.crosshyou.info