Decision Trees and Random Forests
In this tutorial we are building a machine learning model based on decision trees and random forest:
First of all import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Next thing is to import our data as a read csv file method and then understand the data what is it about. We got this data from lendingclub.com as an open source data.
There are 14 columns in the data let us see what do they mean
col_name | |
credit.policy | 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. |
purpose | The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”). |
int.rate | The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates |
installment | The monthly installments owed by the borrower if the loan is funded. |
log.annual.inc | The natural log of the self-reported annual income of the borrower. |
dti | The debt-to-income ratio of the borrower (amount of debt divided by annual income) |
fico | The FICO credit score of the borrower |
days.with.cr.line | The number of days the borrower has had a credit line. |
revol.bal | The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle). |
revol.util | The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available). |
inq.last.6mths | The borrower’s number of inquiries by creditors in the last 6 months. |
delinq.2yrs | The number of times the borrower had been 30+ days past due on a payment in the past 2 years. |
pub.rec | The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments). |
Next, check the data for any missing values in it by df.info ( )
Yay! This dataframe has zero null values. Now let us divide the data into target variable ( y ) and other independent variables , here our target variable is “not fully paid”. But there is column that contains strings , that cannot be trained.
Now we would train our 70% of the data.
Now let us import decision tree classes and train the model
Now let us test and evaluate the data and check the results
We have successfully built a model based on decision tree which has a very average accuracy. Now let us build a model based on random forests
Here in the function RandomForestClassifier, we have passed an integer which means number of decision trees in the random forests, as we have discussed in the definition of random forests.
Now let us see the evaluation of the model
We can observe the difference between the two models, there is a good improvement in the f1 score and accuracy of the model.
This is the end of this tutorial about Decision Trees and Random Forest
- In the upcoming tutorials, we would learn about
Support Vector Machines