OECD Total official and private flows data analysis 4 - Classification using R's tree package.

Generated by Bing Image Creator: Picture with lots of red maple leaves

This post is following of the above post.

In the previous post, I made a dummy variable 'neg' which indicates whether a LOCATION has negative value or not.

In this post, let's do classification.
First, I make a sacatter plot for TIME and Valuse with coloring by 'neg'.

I divided df_main into two data frames, one is for training the other is testing.

Let's see summary statistics of the two data frames.

I see the both 'neg' mean is 0.3889 and TIME and Value have similar statistics. So I think df_main are well divided into two data frames.

All right, let's do classification. According to the above scatter plot, it seems not good to use linear model, I use tree model. I use 'tree' package.
I refered to R Tree Package | How does the Tree Package work? (educba.com)

First, I load 'tree' package.

Then, I use tree() function.

Then, I plot the results.

Let's predict with df_testing data.

So, let's see how the prediction is good or bad.

Above contingency table shows tree model correctly predict 108 + 77 = 185, wrongly predict 28 + 57 = 85.
So accuracy is 185 / (185 + 85) = 68.5%

If I predict df_testing has neg = 0 for all observations, accuracy is 1 - 0.3889 = 61.1%.

So, tree model is better than no prediction.

That's it. Thank you!
Next post is

To read from the first post,