www.crosshyou.info

政府統計の総合窓口のデータや、OECDやUCIやのデータを使って、Rの練習をしています。ときどき、読書記録も載せています。

都道府県別の工業用水量のデータの分析１ - R言語にデータを取り込む。

データ分析

f:id:cross_hyou:20211103115124j:plain

Photo by Joanna Huang on Unsplash

今回は、都道府県別の工業用水のデータを分析してみようと思います。

データは、政府統計の総合窓口(www.e-stat.go.jp)から取得しました。

f:id:cross_hyou:20211103115251p:plain

47の都道府県を選択します。

f:id:cross_hyou:20211103115314p:plain

工業用水量のデータの他に総人口、県内総生産額、製造業付加価値額、製造業事業所数、製造業行就業者数のデータを選択しました。

f:id:cross_hyou:20211103115438p:plain

このような形式のCSVファイルをダウンロードできます。

これをRで読み込み分析します。

はじめにtydyverseという便利なパッケージを読み込みしておきます。

f:id:cross_hyou:20211103120526p:plain

次に、read_csv関数でCSVファイルのデータを読み込みます。

f:id:cross_hyou:20211103120837p:plain

str()関数で読み込んだデータを見てみます。

f:id:cross_hyou:20211103121028p:plain

year1とprefが文字化けしてしまったので削除します。

f:id:cross_hyou:20211103121221p:plain

yearは、2019100000となっているので、100000を引いてから1000000で割って4桁の西暦に戻します。

f:id:cross_hyou:20211103121605p:plain

再び、str()関数を使ってデータをみてみましょう。

f:id:cross_hyou:20211103121823p:plain

余計なattribution(属性)を削除します。

f:id:cross_hyou:20211103122050p:plain

またまた、str()関数でみてみます。

f:id:cross_hyou:20211103122207p:plain

year = col_double()のような余計なattributionがなくなりました。

waterがNAの行は削除します。

f:id:cross_hyou:20211103122452p:plain

dfをsummary()関数でみてみます。

f:id:cross_hyou:20211103122801p:plain

ここで各変数が何かを確認します。

year: 調査年

code: 地域コード(各都道府県を表す)

pop: 総人口(人)

gdp: 県内総生産額(平成17年基準・百万円)

kachi: 製造業付加価値額(百万円)

num: 製造業事業所数(事業所)

man: 製造業従業者数(人)

water: 工業用水量(m3/日)

です。

工業用水量のデータのヒストグラムを見てみます。

f:id:cross_hyou:20211103123540p:plain

f:id:cross_hyou:20211103123553p:plain

対数変換したほうがよさそうですね。

f:id:cross_hyou:20211103124014p:plain

f:id:cross_hyou:20211103124023p:plain

２つの山がある分布のようですね。

popなどその他もヒストグラムにしてみます。

f:id:cross_hyou:20211103124220p:plain

f:id:cross_hyou:20211103124229p:plain

総人口も対数変換したほうがよさそうです。

f:id:cross_hyou:20211103124405p:plain

f:id:cross_hyou:20211103124415p:plain

gdpのヒストグラムをみてみます。

f:id:cross_hyou:20211103124536p:plain

f:id:cross_hyou:20211103124550p:plain

gdpも対数変換したほうがよさそうです。

f:id:cross_hyou:20211103124743p:plain

f:id:cross_hyou:20211103124756p:plain

kachiのヒストグラムを見てみます。

f:id:cross_hyou:20211103124932p:plain

f:id:cross_hyou:20211103124942p:plain

kachiも対数変換したほうがいいですね。

f:id:cross_hyou:20211103125136p:plain

f:id:cross_hyou:20211103125146p:plain

numのヒストグラムを見てみます。

f:id:cross_hyou:20211103125257p:plain

f:id:cross_hyou:20211103125306p:plain

やっぱりnumも対数変換したほうがいいようです。

f:id:cross_hyou:20211103125447p:plain

f:id:cross_hyou:20211103125456p:plain

結局、すべての変数で対数変換したほうがより正規分布に近くなっていますね。

もとものの変数は対数正規分布になっていたということでしょうね。

対数正規分布については、

統計分布を知れば世界が分かる-身長・体重から格差問題まで (中公新書)

統計分布を知れば世界が分かる-身長・体重から格差問題まで (中公新書)

作者:松下貢
中央公論新社

の書籍に詳しく書いてありました。

今回は以上です。

次回は、

www.crosshyou.info

です。