2022-11-05

OECD Researchers data analysis 5 - Simple Linear Regression with one numerical variable in R, ModernDive way

Data_Analysis

UnsplashのMadara Parmaが撮影した写真

www.crosshyou.info

This post is following of the above post.
In this post, I will do linear regression analysis.

To do this, I make a small(subset) data frame. Let's check what TIME has the most observations.

2017 and 2015 have 36 observations. So I filter TIME for only 2017 and 2015.

Let's see this dataframe with flimpse() function.

I am interested in that "whether Researchers data is related to USD_CAP or l_usd_cap".

So, firstly, let's check correlation.

I see l_tot_1000employed is the most correlated to l_usd_cap.

So, let's check linear regression, dependent variable = l_usd_cap, independent variable = l_tot_1000_employed.

Before doing linear regression, let's see scatter plot.

I use lm() fundtion to make linear regression object.

Then, I usually use summary() function to see linear regression result, but this time I loaded moderndive package and I will use moderndive packages's get_regression_table() function.

I see l_tot_1000employed's estimate is 0.456. It means when l_tot_1000employed increase by 1, l_usd_cap will increase by 0.456. It means when TOT_1000EMPLOYED inclease by 1%, USD_CAP will inclease by 0.456%.

Let's add regression line to the previus scatter plot. I use gerom_smooth() function.

I refered to Statistical Inference via Data Science (moderndive.com)

That's it. Thank you!

The next post is

www.crosshyou.info

To read the first post,

www.crosshyou.info

2022-11-03

OECD Researchers data analysis 4 - Sorting dataframe by column in R

Data_Analysis

UnsplashのKarsten Würthが撮影した写真

www.crosshyou.info

This post is following of the above post.

In this post, let's sort dataframe by variables.

The smallest TOT_1000EMPLOTED observation is CHL 2009.

The largest TOT_1000EMPLOYED observation is FIN 2004.

The smallest WOMEN_PC_RESEARCHER observation si KOR 1997.

The largest WOMEN_PC_RESEARCHER observation is LVA 2008.

The smallest WOMEN_HEADCOUNT observation is LUX 2003.

The largest WOMEN_HEADCOUNT observation is GBR 2019.

The smallest TOT_HEADCOUNT observation is LUX 2003.

The largest TOT_HEADCOUNT observation is JPN 2020.

The smallest MLN_USD observation is ISL 1999.

The largest MLN_USD observation is JPN 2019.

The smallest USD_CAP observation is ROU 1995.

The largest USD_CAP observation is LUX 2019.

So, LOCATIONS which has at least one the smallest/largest variables are CHL, FIN, KOR, LVA, LUX, GBR, JPN, ISL and ROU.

That's it. Thank you!

Next post is

www.crosshyou.info

To read from the first post,

www.crosshyou.info

2022-10-30

OECD Researchers data analysis 3 - 5 Named Graphs in R

Data_Analysis

UnsplashのPhong Nguyenが撮影した写真

www.crosshyou.info

This post is following of the above post.
In this post I will create 5 names graphs in R.

I refer to Chapter 2 Data Visualization | Statistical Inference via Data Science (moderndive.com)

5 Named Grpahs are

#1 scatterplots

#2 linegraphs

#3 histograms

#4 boxplots

#5 barplots

Let's start with #1 scatterplots

scatterplots shows two variables relationship. The above plot show TIME vs l_tot_1000employed. In general recent time has more large l_tot_1000employed.

The above scatterplot shows l_usd_cap vs. l_tot_1000empolyed. The more l_usd_cap, the more l_tot_1000employed.

The next 5 named graphs is #2 linegraphs

The above linegrap shows TIME vs. l_women_pc_researcher. Recent TIME has largeer l_woemn_pc_researcher.

The 3rd five named graphs is #3 histograms.

Histgrams shows variable distribution. The above histogram shows l_women_headcount distribution. Let's see another histograms.

I added " facet_wrap(~ TIME) " so that I can see histograms by TIME.

The 4th five named grapsh is boxplots.

I see recent TIME has greater l_tot_headcount in general.

Tha last 5 named grpahs is #5 barplots.

I see odd number TIME has more observations than even number TIME has.

Let's see number of observations by LOCATION.

HUN has the most observations.

That's it. Thank you!

Next post is

www.crosshyou.info

To read from the first post,

www.crosshyou.info

2022-10-30

OECD Researchers data analysis 2 - Converting long format dataframe to wide format dataframe and merge two dataframes with R

Data_Analysis

UnsplashのSakuraが撮影した写真

www.crosshyou.info

This post is floowing of the above post.

Let's explore gdp dataframe.

gdp dataframe has more LOCATION than researcher dataframe.

gdp dataframe INDICATOR has only one calue, GDP. So I can remove it.

gdp SUBJECT has only one value, TOT, so I can remove it.

gdp MEASURE has two value: USD_CAP and NLN_USD, so I cannot remove it.

gdp FREQUENCY has only one value: A. I can remove it.

gdp TIME starts at 1960 and ends at 2021.

So far, I can remove INDICATOR, SUBJECT and FREQUENCY from gdp dataframe.

Let's see new dataframe.

Next, I merge the two dataframes, before doing that, I convert to dataframe into wide format.

Let's see this new wide format dataframe.

I see AUT 1998 has 5.107570 for TOT_1000EMPLOYED, 18.79060 for WOMEN_PC_RESEARCHER, 5901 for WOMEN_HEADCOUNT and 31404 for TOT_HEADACOUNT.

Let's convert gdp_v2 to wide format too.

Let's see it.

AUS 1960 has 25073.26 for MLN_USD and 2412.765 for USD_CAP.

Finally, I can merge the two wide format dataframe.

Let's see it.

Then, I convert LOCATION to factor type from character.

Let's see summary statistics with summary() frunction.

I see all numeric variables are greater than zero, so I will make natural logarithm variables.

Let's see log variables summary statistics.

Let's call it a day. Thank you!

The next post is

www.crosshyou.info

To read from the 1st post,

www.crosshyou.info

2022-10-29

読書記録 - 「生物学探偵セオ・クレイ森の捕食者」アンドリュー・メイン著ハヤカワ・ミステリ文庫

読書記録

生物学探偵セオ・クレイ　森の捕食者 (ハヤカワ・ミステリ文庫)

作者:アンドリューメイン
早川書房

Amazon

著者は有名なマジシャンだそうです。才能がいっぱいあるのですね。

主人公のセオ・クレイは生物工学(原著を確認していなですが、たぶん、bioinfomaticsだと思います。)の教授で、生物学とコンピュータサイエンスを融合した学問分野の教授です。

面白かったので、2作目も読んでみようと思います。

2022-10-29

OECD Researchers data analysis 1 - Load CSV file into R with read_csv() function.

Data_Analysis

UnsplashのMarek Piwnickiが撮影した写真

In this post, I will analyze OECD Researchers data.

Researchers are professionals engaged in the conception or creation of new knowledge, products, processes, methos and systems, as well as in the management of the projecs concerned. This indicator is measured in per 1000 people employed and in number of researchers, the data are available as a total and broken down by gender.

From the OECD website, (Research and development (R&D) - Researchers - OECD Data), I get the CSV file like above.

I also get the GDP data from the OECD website, (GDP and spending - Gross domestic product (GDP) - OECD Data).

Let's analyze(play around) those data with R!

Firstly, I load tidyverse package and moderndive package because I am reading Statistical Inference via Data Science (moderndive.com) and I am planning to follow this book.