M2M – May – Update #3

For this week, I continued with my machine learning in R coursework, specifically the Supervised Learning module. The topics were decision trees and random forests.

Decision trees function like “if-else” conditions used to make classification decisions. They are good when the creator wants transparency about what went into the decision, such as why a loan was given or denied. They work by repeatedly splitting the observations in groups such that there is homogeneity within each group of the split (called “purity”). They are axis parallel and unfortunately cannot use interactions between variables.

Over-fitting occurs when a model fits the training data much better than it would the test data. This may be because it learns the nuances of the training data very well, but that adds no value in regards to the test data. That’s why it is so important to split the data into train and test, then get model accuracy on both.

Pruning is the idea of reducing the number of branches in a random forest. This can either be done beforehand (pre-pruning) or afterwards (post-pruning). Pre-pruning can be done by saying either: a) stop splitting at a certain depth or b) stop splitting when a branch has less than X observations. Post-pruning is removing branches pasted on the complexity (cp) of the model.

Random forest takes decision trees and varies the features and observations used in each tree. Doing this many times creates teamwork among trees called an ensemble.

Here is some R to do what I describe above:

  • Fit model

library(rpart)
loan_model = rpart(outcome ~ loan_amount + credit_score, data = loans, method = “class”, control = rpart.control(cp = 0))

  • Predict outcomes for sample observation

predict(loan_model, good_credit, type = “class”)

  • Graph tree

rpart.plot(loan_model, type = 3, box.palette = c(“red”, “green”), fallen.leaves = TRUE)

  • Get train and test datasets

# Determine the number of rows for training
nrow(loans)*0.75

#Create a random sample of row IDs
sample_rows = sample(11312, 11312*.75)

#Create the training dataset
loans_train = loans[sample_rows,]

#Create the test dataset
loans_test = loans[-sample_rows,]

  • Create tree with control data and get accuracy on test data

#Grow a tree using all of the available applicant data
loan_model = rpart(outcome ~ ., data = loans_train, method = “class”, control = rpart.control(cp = 0))

# Make predictions on the test dataset
loans_test$pred = predict(loan_model, loans_test, type = ‘class’)

# Examine the confusion matrix
table(loans_test$pred, loans_test$outcome)

# Compute the accuracy on the test dataset
mean(loans_test$pred == loans_test$outcome)

  • Plot complexity

plotcp(loan_model)

  • Prune tree

loan_model_pruned = prune(loan_model, cp = 0.0014)

That’s all for this week. Happy Memorial Day.

Leave a Reply

Your email address will not be published. Required fields are marked *