Decision Trees and Random Forests

Decision Trees and Random Forests

In this tutorial we are building a machine learning model based on decision trees and random forest:

First of all import the libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

Next thing is to import our data as a read csv file method and then understand the data what is it about. We got this data from lendingclub.com as an open source data.

There are 14 columns in the data let us see what do they mean

col_name
credit.policy	1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose	The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
int.rate	The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates
installment	The monthly installments owed by the borrower if the loan is funded.
log.annual.inc	The natural log of the self-reported annual income of the borrower.
dti	The debt-to-income ratio of the borrower (amount of debt divided by annual income)
fico	The FICO credit score of the borrower
days.with.cr.line	The number of days the borrower has had a credit line.
revol.bal	The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util	The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths	The borrower’s number of inquiries by creditors in the last 6 months.
delinq.2yrs	The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec	The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Next, check the data for any missing values in it by df.info ( )

Yay! This dataframe has zero null values. Now let us divide the data into target variable ( y ) and other independent variables , here our target variable is “not fully paid”. But there is column that contains strings , that cannot be trained.

Now we would train our 70% of the data.

Now let us import decision tree classes and train the model

Now let us test and evaluate the data and check the results

We have successfully built a model based on decision tree which has a very average accuracy. Now let us build a model based on random forests

Here in the function RandomForestClassifier, we have passed an integer which means number of decision trees in the random forests, as we have discussed in the definition of random forests.

Now let us see the evaluation of the model

We can observe the difference between the two models, there is a good improvement in the f1 score and accuracy of the model.

This is the end of this tutorial about Decision Trees and Random Forest

In the upcoming tutorials, we would learn about
Support Vector Machines

Spread knowledge

Decision Trees and Random Forests

Leave a Comment Cancel Reply