Import tools¶

We are importing all the necessary libraries.

In [25]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Constructing a model (algorithm)¶

DTC (Decision Tree Classfier) is built in such a way that contains two kinds of nodes:

decision nodes which contain:
1. condition which is defined by feature_index, threshold value for the particular feature
  
  feature_index is the index of a feature threshold - certain value of a feature that we use to compare other feature values it.
2. left, right are for accessing left and right child.
3. inforamtion gain - variable that stores the information gained by the split
leaf nodes which contain: value which is a majority class of the leaf node. - it helps us to determine the class of a data point if it ends up in this particular leaf node.

First we will define a Node class, then we are defining a Tree class which will have all the methods that we can perform on our tree. Tree class will basically allow us to build our tree based on the splits of our data that we will perform. Those splits and it's results (left and right child) will be in form of a Node that we have defined.

Node class¶

In [26]:

class Node():
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, info_gain=None, value=None, curr_depth = 0):
        ''' constructor ''' 
        
        #for decision node
        self.feature_index = feature_index
        self.threshold = threshold
        self.left = left
        self.right = right
        self.info_gain = info_gain #this variable stores the information gained by the split denoted 
                                   #by this particulart decision node
        
        self.curr_depth = curr_depth #this variable is for current depth of a node. It's for both type of nodes.
        
        #for leaf node
        self.value = value
                            #majority class of the leaf node...
            
                            #it will help us to determine the class of a new data point 
                            #if the data point ends up in this particular leaf node

Tree class¶

In [27]:

class DecisionTreeClassifier():
    def __init__(self, min_samples_split=2, max_depth=2):
        ''' constructor '''
        
        # initialize the root of the tree 
        self.root = None
        
        # stopping conditions
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth
        
        #if in a particular node the number of samples becomes less than 
        #min_samples_split we won't split that node any further, we will
        #treat that node as a leaf node. Same goes for max_depth.

                    
    #MOST IMPORTANT FUNCTION - recursive function for building a binary tree using a recursive function. 
    #This function takes dataset as an input, performs a best split of the dataset - creating left and right child,
    #which either can be pure leaf node (node with only data points with one class) 
    #or a node with the remaining data and the condition that performs further splits of the data in that node.

    def build_tree(self, dataset, curr_depth=0):
        ''' recursive function to build the tree ''' 
        
        #splitting the dataset into two seperate variables, one containing feature and other containing the classes.
        X, Y = dataset[:,:-1], dataset[:,-1]
        
        #extracting the number of samples and the number of features 
        num_samples, num_features = np.shape(X)
        
        # split until stopping conditions are met
        if num_samples>=self.min_samples_split and curr_depth<=self.max_depth:
            # find the best split
            best_split = self.get_best_split(dataset, num_samples, num_features)
            # check if information gain is positive
            if best_split["info_gain"]>0:
                # recur left
                left_subtree = self.build_tree(best_split["dataset_left"], curr_depth+1)
                # recur right
                right_subtree = self.build_tree(best_split["dataset_right"], curr_depth+1)
                # return decision node
                return Node(best_split["feature_index"], best_split["threshold"], 
                            left_subtree, right_subtree, best_split["info_gain"], curr_depth=curr_depth)
        
        # compute leaf nod
        leaf_value = self.calculate_leaf_value(Y)
        # return leaf node
        return Node(value=leaf_value, curr_depth=curr_depth)
    
    
    #SECOND MOST IMPORTANT FUNCTION
    def get_best_split(self, dataset, num_samples, num_features):
        ''' function to find the best split '''
        
        # dictionary to store the best split
        best_split = {}
        max_info_gain = -float("inf")
        
        # loop over all the features
        for feature_index in range(num_features):
            feature_values = dataset[:, feature_index]
            
            possible_thresholds = np.unique(feature_values)
            # loop over all the unique feature values present in the data
            for threshold in possible_thresholds:
                # get current split
                dataset_left, dataset_right = self.split(dataset, feature_index, threshold)
                # check if childs are not null
                if len(dataset_left)>0 and len(dataset_right)>0:
                    
                    #extracing the classes of the dataset before split, as well as the classes of the
                    #right and left child after the split. (we use these array for computing information gain)
                    y, left_y, right_y = dataset[:, -1], dataset_left[:, -1], dataset_right[:, -1]
                    
                    #compute information gain
                    curr_info_gain = self.information_gain(y, left_y, right_y, "gini")
                    
                    #update the best split (dictionary) if needed (we will update it if the current information gain 
                    #is greater than the previous one.)
                    if curr_info_gain>max_info_gain:
                        best_split["feature_index"] = feature_index
                        best_split["threshold"] = threshold
                        best_split["dataset_left"] = dataset_left
                        best_split["dataset_right"] = dataset_right
                        best_split["info_gain"] = curr_info_gain
                        max_info_gain = curr_info_gain
                        
        # return best split
        return best_split
    
    def split(self, dataset, feature_index, threshold):
        ''' function to split the data '''
        
        dataset_left = np.array([row for row in dataset if row[feature_index]<=threshold])
        #left side contains data points that meet our threshold condition, passing all the rows 
        #for which the feature value is less or equal to threshold.
        
        dataset_right = np.array([row for row in dataset if row[feature_index]>threshold])
        #right side contains those rows for which the particular value is greater than threshold.

        return dataset_left, dataset_right
    
    def information_gain(self, parent, l_child, r_child, mode="entropy"):
        ''' function to compute information gain '''
        
        weight_l = len(l_child) / len(parent)
        weight_r = len(r_child) / len(parent)
        
        if mode=="gini":
            gain = self.gini_index(parent) - (weight_l*self.gini_index(l_child) + weight_r*self.gini_index(r_child))
        else:
            gain = self.entropy(parent) - (weight_l*self.entropy(l_child) + weight_r*self.entropy(r_child))
            
            #here we can see two types of measuring the information contained in a system, gini and entropy.
            #entropy = ∑-p_i*log(p_i)
            #gini_index = 1 - ∑p_i**2, 
                #where p_i = probability of class i 
                #Why would we use gini function? Unlike entropy function, gini doesn't have logarithmic part,
                #so by choosing gini function we have actually done a favor to us which is saving computation time - 
                #(it is easier to find square of a quantity than to find the logarithm.)
            
        return gain
    
    def entropy(self, y):
        ''' function to compute entropy '''
        
        class_labels = np.unique(y)
        entropy = 0
        for cls in class_labels:
            p_cls = len(y[y == cls]) / len(y)
            entropy += -p_cls * np.log2(p_cls)
        return entropy
    
    def gini_index(self, y):
        ''' function to compute gini index '''
        
        class_labels = np.unique(y)
        gini = 0
        for cls in class_labels:
            p_cls = len(y[y == cls]) / len(y)
            gini += p_cls**2
        return 1 - gini
    
    
    def calculate_leaf_value(self, Y):
        ''' function to compute leaf node '''
        #the value of a leaf node is the majority class present in the node
        #so...we just need to find the most occuring element in y!
        
        Y = list(Y)
        return max(Y, key=Y.count)
    
    
    
    #THIS FUNCTION HELPS US TO VISUALIZE DECISION TREE
    def print_tree(self, tree=None, indent=" "):
        ''' function to print the tree '''
        
        if not tree:
            tree = self.root

        if tree.value is not None:
            print(tree.value)

        else:
            print("X_"+str(tree.feature_index), "≤", tree.threshold, "?", np.round(tree.info_gain,3))
            print(tree.curr_depth + 1,":","%sleft: " % (indent), end="")
            self.print_tree(tree.left, indent + "   ")
            print(tree.curr_depth + 1 ,":","%sright: " % (indent), end="")
            self.print_tree(tree.right, indent + "   ")
    
    def fit(self, X, Y):
        ''' function to train the tree '''
        
        dataset = np.concatenate((X, Y), axis=1)
        self.root = self.build_tree(dataset)
    
    def predict(self, X):
        ''' function to predict new dataset '''
        
        preditions = [self.make_prediction(x, self.root) for x in X]
        return preditions
    
    def make_prediction(self, x, tree):
        ''' function to predict a single data point '''
        
        if tree.value!=None: return tree.value
        feature_val = x[tree.feature_index]
        if feature_val<=tree.threshold:
            return self.make_prediction(x, tree.left)
        else:
            return self.make_prediction(x, tree.right)

Analysis of the first dataset¶

About Dataset¶

Context

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems". It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris Virginica, and Iris Versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

This dataset became a typical test case for many statistical classification techniques in machine learning such as support vector machines

Reference

R. A. Fisher (1936). "The use of multiple measurements in taxonomic problems". Annals of Eugenics. 7 (2): 179–188.

https://en.wikipedia.org/wiki/Iris_flower_data_set

Content

The dataset contains a set of 150 records under 5 attributes - Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).

So, our objective here is to predict the class that is the specie of the iris flower, given it's features which are:

sepal_length
sepal width
petal_length
petal_width

The classes are species of flowering plant in the genus Iris of the family Iridaceae. Here we have three classes:

Setosa
Versicolar
Virginica

Get the data¶

In [28]:

col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'type']
data = pd.read_csv("iris_data.csv", names=col_names)
data

Out[28]:

	sepal_length	sepal_width	petal_length	petal_width	type
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

150 rows × 5 columns

Let's now implement our algorithm on this dataset.

Train-Test split¶

Here I am creating train and test dataset. We will train our model on Train dataset and test it with Test dataset.

In [29]:

X = data.iloc[:, :-1].values
Y = data.iloc[:, -1].values.reshape(-1,1)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=41)

Fit the model¶

We first create the constructor of DecisionTreeClassfier class and then build a model by fitting constructor with the train dataset.

In [30]:

classifier = DecisionTreeClassifier(min_samples_split=3, max_depth=3)
classifier.fit(X_train,Y_train)

Model visualization¶

We will use the method print.tree() to visualize our tree.

In [31]:

classifier.print_tree()

X_2 ≤ 1.9 ? 0.337
1 :  left: Iris-setosa
1 :  right: X_3 ≤ 1.5 ? 0.427
2 :     left: X_2 ≤ 4.9 ? 0.051
3 :        left: Iris-versicolor
3 :        right: Iris-virginica
2 :     right: X_2 ≤ 5.0 ? 0.02
3 :        left: X_1 ≤ 2.8 ? 0.208
4 :           left: Iris-virginica
4 :           right: Iris-versicolor
3 :        right: Iris-virginica

Testing the model¶

We are using definded method predict() to determine the classes of the Test dataset - those will be stored in the Y_pred which we will then compare to Y_test with the help of sklearn library function called accuracy_score

In [41]:

Y_pred = classifier.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy_score(Y_test, Y_pred)

Out[41]:

0.9333333333333333

Analysis of the second dataset¶

Our objective here is to predict if the customer will purchase the iPhone or not given their gender, age and salary.

About the data¶

Despite all the effort I couldn't find the origin of this data therefore it shouldn't be used for any other purposes. The dataset contains a set of 400 records under 4 attributes - Gender, Age, Salary and Class( whether the person made a purchase or not).

Get the data¶

In [33]:

dataset = pd.read_csv("iphone_purchase_records.csv")

In [34]:

dataset

Out[34]:

	Gender	Age	Salary	Purchase Iphone
0	Male	19	19000	0
1	Male	35	20000	0
2	Female	26	43000	0
3	Female	27	57000	0
4	Male	19	76000	0
...	...	...	...	...
395	Female	46	41000	1
396	Male	51	23000	1
397	Female	50	20000	1
398	Male	36	33000	0
399	Female	49	36000	1

400 rows × 4 columns

Converting gender to number¶

In [35]:

#Convert gender variable into dummy/indicator variables or (binary vairbles) essentialy 1's and 0's.
#I chose the variable name one_hot_data bescause in ML one-hot is a group of bits among which the 
#legal combinations of values are only those with a single high (1) bit and all the others low (0)

one_hot_data = pd.get_dummies(dataset)
new_cols = ["Gender_Female", "Gender_Male", "Age", "Salary","Purchase Iphone"]
data2 = one_hot_data[new_cols]
data2

Out[35]:

	Gender_Female	Gender_Male	Age	Salary	Purchase Iphone
0	0	1	19	19000	0
1	0	1	35	20000	0
2	1	0	26	43000	0
3	1	0	27	57000	0
4	0	1	19	76000	0
...	...	...	...	...	...
395	1	0	46	41000	1
396	0	1	51	23000	1
397	1	0	50	20000	1
398	0	1	36	33000	0
399	1	0	49	36000	1

400 rows × 5 columns

Train-Test split¶

In [36]:

X2 = data2.iloc[:, :-1].values
Y2 = data2.iloc[:, -1].values.reshape(-1,1)

from sklearn.model_selection import train_test_split
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X2, Y2, test_size=.2, random_state=41)

In [37]:

len(X_train2), len(X_test2)

Out[37]:

(320, 80)

Fit the model¶

In [38]:

classifier2 = DecisionTreeClassifier(min_samples_split=3, max_depth=10)
classifier2.fit(X_train2,Y_train2)

Visualizing the model¶

In [39]:

classifier2.print_tree()

X_2 ≤ 42 ? 0.176
1 :  left: X_3 ≤ 90000 ? 0.185
2 :     left: X_2 ≤ 36 ? 0.005
3 :        left: 0
3 :        right: X_3 ≤ 80000 ? 0.049
4 :           left: X_3 ≤ 74000 ? 0.018
5 :              left: X_2 ≤ 37 ? 0.003
6 :                 left: X_3 ≤ 62000 ? 0.03
7 :                    left: 0
7 :                    right: X_3 ≤ 70000 ? 0.375
8 :                       left: 1
8 :                       right: 0
6 :                 right: 0
5 :              right: X_2 ≤ 39 ? 0.022
6 :                 left: X_3 ≤ 79000 ? 0.111
7 :                    left: X_0 ≤ 0 ? 0.056
8 :                       left: X_3 ≤ 77000 ? 0.444
9 :                          left: 0
9 :                          right: 1
8 :                       right: X_3 ≤ 78000 ? 0.444
9 :                          left: 1
9 :                          right: 0
7 :                    right: 0
6 :                 right: X_2 ≤ 41 ? 0.052
7 :                    left: 0
7 :                    right: X_0 ≤ 0 ? 0.111
8 :                       left: 0
8 :                       right: 1
4 :           right: 1
2 :     right: X_2 ≤ 34 ? 0.043
3 :        left: X_3 ≤ 118000 ? 0.121
4 :           left: X_3 ≤ 112000 ? 0.214
5 :              left: 1
5 :              right: X_3 ≤ 116000 ? 0.122
6 :                 left: 0
6 :                 right: X_2 ≤ 26 ? 0.167
7 :                    left: 0
7 :                    right: X_2 ≤ 31 ? 0.111
8 :                       left: 1
8 :                       right: 0
4 :           right: 1
3 :        right: 1
1 :  right: X_3 ≤ 38000 ? 0.014
2 :     left: X_3 ≤ 22000 ? 0.016
3 :        left: X_2 ≤ 46 ? 0.125
4 :           left: 0
4 :           right: 1
3 :        right: 1
2 :     right: X_2 ≤ 50 ? 0.027
3 :        left: X_3 ≤ 82000 ? 0.052
4 :           left: X_2 ≤ 48 ? 0.042
5 :              left: X_2 ≤ 47 ? 0.069
6 :                 left: X_3 ≤ 43000 ? 0.125
7 :                    left: 0
7 :                    right: X_3 ≤ 79000 ? 0.112
8 :                       left: X_3 ≤ 47000 ? 0.037
9 :                          left: 1
9 :                          right: X_2 ≤ 46 ? 0.12
10 :                             left: 0
10 :                             right: 1
8 :                       right: 0
6 :                 right: 1
5 :              right: X_3 ≤ 39000 ? 0.375
6 :                 left: 1
6 :                 right: 0
4 :           right: X_3 ≤ 139000 ? 0.028
5 :              left: X_2 ≤ 46 ? 0.041
6 :                 left: X_0 ≤ 0 ? 0.122
7 :                    left: 1
7 :                    right: X_2 ≤ 44 ? 0.167
8 :                       left: X_2 ≤ 43 ? 0.111
9 :                          left: 0
9 :                          right: 1
8 :                       right: 0
6 :                 right: 1
5 :              right: 1
3 :        right: X_0 ≤ 0 ? 0.015
4 :           left: 1
4 :           right: X_3 ≤ 42000 ? 0.085
5 :              left: 0
5 :              right: X_2 ≤ 58 ? 0.02
6 :                 left: X_2 ≤ 52 ? 0.027
7 :                    left: X_3 ≤ 114000 ? 0.125
8 :                       left: 1
8 :                       right: 1
7 :                    right: 1
6 :                 right: X_3 ≤ 76000 ? 0.444
7 :                    left: 1
7 :                    right: 0

Testing the model¶

In [40]:

Y_pred2 = classifier2.predict(X_test2)

from sklearn.metrics import accuracy_score
accuracy_score(Y_test2, Y_pred2)

Out[40]:

0.8625

409 KiB Raw Blame History Unescape Escape

Import tools¶

Constructing a model (algorithm)¶

Node class¶

Tree class¶

Analysis of the first dataset¶

About Dataset¶

Get the data¶

Train-Test split¶

Fit the model¶

Model visualization¶

Testing the model¶

Analysis of the second dataset¶

About the data¶

Get the data¶

Converting gender to number¶

Train-Test split¶

Fit the model¶

Visualizing the model¶

Testing the model¶

409 KiB

Raw Blame History