Skip to main content

Scikit-Learn is a library for Python that contains numerous useful algorithms that can easily be implemented and altered for the purpose of classification and other machine learning tasks. 

One of the most fascinating things about the Scikit-Learn library is that is has a 4-step modelling pattern that makes it easy to code a machine learning classifier:

1.Import the model you want to use: In Scikit-Learn, all machine learning models are implemented as Python classes.

2. Make an instance of the Model.

3. Training the model on the data and storing the information learned from the data.

4. Predicting the labels of new data, using the information the model learned during the training process.

IMPLEMENTATION

Loading the Dataset:

The Scikit-learn library provides numerous datasets, among which we will be using a data set of images called Digits. This data set consists of 1,797 images that are 8x8 pixels in size. Each image is a handwritten digit in grayscale.

We load the Dataset as shown below. After loading we can run the following command to know the shape of the dataset.



Visualizing the images and labels in our dataset: We can obtain the greyscale image using matplotlib library.



Splitting our data set into training and testing sets: Now let’s split our Dataset into training and test sets to make sure that after we train our model, it is able to generalize well to new data.

The Scikit-Learn 4-Step Modelling Pattern:
Step 1. Importing the model we want to use.
Here we will be using Logistic Regression. Logistic regression is a linear classifier and therefore used when there is some sort of linear relationship between the data.
Step 2. Making an instance of the Model
Step 3. Training the Model: Here the Model is learning the relationship between digits (x_train) and labels (y_train)
Step 4. Predicting the labels of new data
Using the information the Model learned during the training process.


Measuring the performance of our Model: To test the accuracy of our predictions we can use accuracy_score

This number is the probability for the digits in the test sample to be classified in the right category, meaning that we get 95.33% of the digits correct.

CONFUSION MATRIX: A confusion matrix is a table that is often used to evaluate the accuracy of a classification model. We can use Seaborn or Matplotlib to plot the confusion matrix. We will be using Seaborn for our confusion matrix.



The above code displays the confusion matrix as shown below.

CONCLUSION: From this article, we can see how to import a dataset, build a model using scikit-learn, train the model, make predictions with it, and finding the accuracy of our prediction, which in our case is 95.33%.

Comments