Analyzing Iris Dataset

Prepared by Mahsa Sadi on 2020 - 06 - 22



In this notebook, we perform three steps:

  1. Reading the iris dataset.
  2. Visualizing the iris dataset.
  3. Building different models over the dataset and evaluate and compare their accuracy.

    The iris data set contains data about different instances of three categories of iris flowers, namely setosa, versicolor and virginica. This data set measures four features (i.e.; attributes of the iris flowers, namely the length and width of sepal, and the length and width of petal) for different instances of iris flowers and identfies the category of each instance.

In [1]:
# pandas is a python library for manipulating and analyzing numerical tables and time-series
import pandas
print ("pandas {}".format (pandas.__version__))

from pandas.plotting import scatter_matrix
pandas 0.24.2
In [2]:
# matplotlib is a Python plotting library for numerical data
import matplotlib
print ("matplotlib {}".format (matplotlib.__version__))

import matplotlib.pyplot
matplotlib 3.1.3
In [3]:
# sklearn is a Python library for machine learning.
#sklearn contains various datasets and the reday-to-use implementation of various machine learning algorithms.
import sklearn
print ("sklearn {}".format (sklearn.__version__))

from sklearn import model_selection
sklearn 0.21.3
In [4]:
from sklearn.metrics import classification_report
In [5]:
from sklearn.metrics import confusion_matrix
In [6]:
from sklearn.metrics import accuracy_score
In [7]:
from sklearn.linear_model import LogisticRegression
In [8]:
from sklearn.tree import DecisionTreeClassifier
In [9]:
from sklearn.neighbors import KNeighborsClassifier
In [10]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
In [11]:
from sklearn.naive_bayes import GaussianNB
In [12]:
from sklearn.svm import SVC

Reading Data

In [13]:
#  The url of the iris dataset.

url  = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
In [14]:
#  The iris dataset has five columns (four of them are independant variables and one of them are dependant variable.)

names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
In [15]:
#Read the iris dataset from the defined url, the names of the columns in the datas set is defined in the names list

data_set = pandas.read_csv (url, names = names)
In [16]:
# Dispaly how many rows and columns the iris data set has

print (data_set.shape)
(150, 5)
In [17]:
# Display the first 20 rows of the iris dataset

print (data_set.head (20))
    sepal_length  sepal_width  petal_length  petal_width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
5            5.4          3.9           1.7          0.4  Iris-setosa
6            4.6          3.4           1.4          0.3  Iris-setosa
7            5.0          3.4           1.5          0.2  Iris-setosa
8            4.4          2.9           1.4          0.2  Iris-setosa
9            4.9          3.1           1.5          0.1  Iris-setosa
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa
12           4.8          3.0           1.4          0.1  Iris-setosa
13           4.3          3.0           1.1          0.1  Iris-setosa
14           5.8          4.0           1.2          0.2  Iris-setosa
15           5.7          4.4           1.5          0.4  Iris-setosa
16           5.4          3.9           1.3          0.4  Iris-setosa
17           5.1          3.5           1.4          0.3  Iris-setosa
18           5.7          3.8           1.7          0.3  Iris-setosa
19           5.1          3.8           1.5          0.3  Iris-setosa

Summerizing Data

In [18]:
# Provide a summary of the iris dataset

print (data_set.describe())
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000
In [19]:
# See how many instances of each class exist

print (data_set.groupby('class').size())
class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64

Visualizing Data

In [21]:
#Draw a box plot for the iris dataset without sub plots

# A box plot helps gain insight about the distribution of each of feature (i.e.; attribute)

data_set.plot (kind = 'box', subplots = False, sharex = False, sharey = False)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f32ad2dbdd0>
In [22]:
# Gain a summerized insight about the distribution of each feature in dataset

# Draw a box plot for the iris dataset with sub plots

# A box plot helps gain insight about the distribution of each of feature (i.e.; attribute)

data_set.plot (kind = 'box', subplots = True, layout = (2,2), sharex = False, sharey = False)
Out[22]:
sepal_length       AxesSubplot(0.125,0.536818;0.352273x0.343182)
sepal_width     AxesSubplot(0.547727,0.536818;0.352273x0.343182)
petal_length          AxesSubplot(0.125,0.125;0.352273x0.343182)
petal_width        AxesSubplot(0.547727,0.125;0.352273x0.343182)
dtype: object
In [23]:
# Gain detailed insight about the distribution of each feature in the dataset; i.e.; look carefully into the distribution of each feature

# Draw the histogram of the dataset.

# The histogram of the features helps us understand whether the distribution of a feature is normal or gGaussian or not.

data_set.hist ()

# From the following histograms we see that the length of petal and the width of sepal has almost a Gaussian distribution.
Out[23]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f32acb88590>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32acaf7690>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f32acb28e90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32acaea690>]],
      dtype=object)
In [24]:
# 4- Visualize Data

# Gain insight about the relationships between features
scatter_matrix (data_set)
Out[24]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f32acb4d7d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac941ad0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac96f690>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac921e90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac8e46d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac898ed0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac859710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac80df10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac816a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac7d9450>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac745790>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac6f6f90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac6ba7d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac6ecfd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac630810>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f32ac661f90>]],
      dtype=object)
In [25]:
data_table = data_set. values
print (data_table [0:20])
[[5.1 3.5 1.4 0.2 'Iris-setosa']
 [4.9 3.0 1.4 0.2 'Iris-setosa']
 [4.7 3.2 1.3 0.2 'Iris-setosa']
 [4.6 3.1 1.5 0.2 'Iris-setosa']
 [5.0 3.6 1.4 0.2 'Iris-setosa']
 [5.4 3.9 1.7 0.4 'Iris-setosa']
 [4.6 3.4 1.4 0.3 'Iris-setosa']
 [5.0 3.4 1.5 0.2 'Iris-setosa']
 [4.4 2.9 1.4 0.2 'Iris-setosa']
 [4.9 3.1 1.5 0.1 'Iris-setosa']
 [5.4 3.7 1.5 0.2 'Iris-setosa']
 [4.8 3.4 1.6 0.2 'Iris-setosa']
 [4.8 3.0 1.4 0.1 'Iris-setosa']
 [4.3 3.0 1.1 0.1 'Iris-setosa']
 [5.8 4.0 1.2 0.2 'Iris-setosa']
 [5.7 4.4 1.5 0.4 'Iris-setosa']
 [5.4 3.9 1.3 0.4 'Iris-setosa']
 [5.1 3.5 1.4 0.3 'Iris-setosa']
 [5.7 3.8 1.7 0.3 'Iris-setosa']
 [5.1 3.8 1.5 0.3 'Iris-setosa']]

Modeling and Analyzing Data

In [26]:
#independant variables of the iris flowers (the features)
X = data_table [:, 0:4]

#dependant variable (the category of the iris flowers )
Y = data_table [:, 4]
In [27]:
# Split the iris dataset into two sets: Training Set and Test Set
# 80% of the iris dataset is for training
#20% of the iris dataset is for testing
#the size of the test set
test_set_size = 0.2
#Randomly select train and test data set from the iris dataset
seed = 6
In [28]:
#Split the iris data set into training and test data sets and choose the training and test data set randomly.
X_train, X_test, Y_train, Y_test = model_selection.train_test_split (X,Y, test_size = test_set_size, random_state = seed)
In [29]:
# Use accuracy as the measure of the performance of the built model
scoring = 'accuracy'
In [30]:
# Build different models for the iris data

models = []

models.append (('Logistic Regression', LogisticRegression(solver ='lbfgs',  multi_class = 'ovr')))
models.append (('Linear Discriminant Analysis', LinearDiscriminantAnalysis()))
models.append (('K Nearest Neigbors', KNeighborsClassifier()))
models.append (('CART', DecisionTreeClassifier()))
models.append (('Support Vector Machine', SVC(gamma ='scale')))
models.append (('Guassian Naive Bayes', GaussianNB()))
In [31]:
results = []
names = []
In [32]:
# When building different models, use cross validation by dividing the dataset into 10 slices, picking 8 of them for training set, and picking 2 of them for test set.
for name, model in models:
    names.append (name)
    K_Fold = model_selection.KFold (n_splits = 10, random_state  = seed)
    cv_results = model_selection.cross_val_score (model, X_train, Y_train, cv = K_Fold, scoring = scoring)
    results.append (cv_results)
    message =  "%s:  %f  (%f)" % (name, cv_results.mean (), cv_results.std())
    print (message)   
Logistic Regression:  0.933333  (0.050000)
Linear Discriminant Analysis:  0.975000  (0.038188)
K Nearest Neigbors:  0.958333  (0.055902)
CART:  0.950000  (0.040825)
Support Vector Machine:  0.950000  (0.076376)
Guassian Naive Bayes:  0.966667  (0.055277)