
Photo by Trevor McKinnon on Unsplash

In this post, I will analyze OECD Gender wage gap data.
From the OECD web site, I downloaded the CSV data file like below.

I will use R to analyze this data.
First, I load tidyverse packages

Then, I use read_csv() function to load CDV data into R.

Let's check each variables.
First, LOCATION

There are many locations, GBR has the most observations, 65. HRV has the least observations, 3.
INDICATOR

INDICATORS ha only one value, WAGEGAP. so I drop this variable from df.

SUBJECT

For SUBJECT, there are two subjects, one is employee and the other is selfemployed.
MEASURE,

There is only one value in MEASURE, so I will drop MEASURE.

FREQUENCY

There is only one value:A in FREQUENCY, so I will drop FREQUENCY from df.

TIME

TIME is numerical data. The minimum is 1970, the maximum is 2020. Mean is 2007.
There is no NA.
Value

Value is numerical data. There is no NA. The minimum is -30.38, The maximum is 63.20.
Flag Codes

Flag Codes has only one value, B. So I will drop it.

All right, let's see df with glimpse() function.

Now, we know there are EMPLOYEE and SELFEMPLOYED in subject.
Let's make two subset data frame, one is for EMPLOYEE only and the other is SELFENPLOYED only.

Let's merge these two data fram with inner_join() function.

Let's change Value.x to emp, Value.y to self. Also, let's change other variables, LOCATION to country, SUBJECT.X to x, TIME to year, SUBJECT.y to y.

I will drop x and y.

Let's change country to factor type.

All right.
Let's see summary of df2.

We see NZL has the most observations. year starts from 1998 to 2019. The minimum emp is -3.13 and the maximum emp us 23.5. The minimum self is -30.38 and the maximum self is 63.20.
That's it. Thank you!
Next post is...