Binary Classification from Scratch using Numpy

Hello friends and welcome to MLDawn!
So what sometimes concerns me is the fact that magnificent packages such as PyTorch (indeed amazing), TensorFlow (indeed a nightmare), and Keras (indeed great), have prevented machine learning enthusiasts to really learn the science behind how the machine learns. To be more accurate, there is nothing wrong with having these packages but I suppose we, as humans, have always wanted to learn the fast and easy way! I am arguing that while we enjoy the speed and ease of coding that such packages bless us with, we need to know what is happening behind the scene!
For example, when we want to build a complicated multi-class classifier with a sophisticated Artificial Neural Network (ANN) architecture, full of convolutional layers and recurrent units, we need to ask ourselves:
Do I know the logic behind this?! Why convolutional layers? Why recurrent layers? Why do I use a softmax() function as the output function? And why do people always use cross-entropy as the appropriate error function in such scenarios? Is there a reason why cross-entropy pairs well with softmax?
So, one way we could understand the answer to some of these questions, is to see whether we can implement a simple binary classifier on some synthetic 1-dimensional data using the simplest ANN possible, from scratch!
In this post we will code this simple neural network from scratch using numpy! We will also use matplotlib for some nice visualisations.
So, if we consider our synthetic data to be a bunch of scalars, and 1-Dimensional, this is the simple ANN structure that we could be interested in building from scratch!
Right! So, now it is time to do the coding bit. First foremost, let’s import the necessary packages! The almighty numpy and matplotlib’s pyplot are both needed. 
I am sure that as a Neural Network enthusiasts, you are familiar with the idea of the sigmoid() function and the binary-cross entropy function. We need to use them during the forward-pass. Before showing you the code, let me refresh your memory on the math: sigmoid(z) = \frac{1}{1 + e^{-z}} and as for the binary-cross entropy: 
E(y ,  y_{hat}}) = -y ln(y_{hat})-(1-y)ln(1-y_{hat})
Note that: The base of the logarithm is e. Meaning that what we have here is the Natural Logarithm!
Remember that this error function is nothing but a measure of difference between the output of the ANN, y_{hat}, and the ground-truth, y. So, the lower the better!
Now let’s see the code.  We can code both of these 2 functions using 2 separate python functions:
Now, we have to think ahead, right? So, during the back-propagation phase, we will need 2 things!
  1. The derivative of the Error w.r.t the output y_{hat}
  2. The derivative of the output y_{hat} w.r.t  z (i.e., the derivative of the output of the sigmoid() function w.r.t the input of the sigmoid() function that is, z)
Now, as a reminder:
\frac{dE}{{dy_{hat}}} = -\frac{y}{y_{hat}}+\frac{(1-y)}{1-y_{hat}}
And as for the sigmoid(z), or for short, sig(z):
\frac{dsig(z)}{dz} = sig(z)(1-sig(z))
This is not a neural network course, so I am not going to derive these here mathematically. Now, let us see the 2 functions in python, whose job is to compute these derivatives:
Next, we need to generate our data points. We could use 2 random normal distributions to generate our 1-Dimensional data points. We will make sure that they are somewhat linearly separable, for simplicity. We can control these distributions by tweaking the parameters of these 2 Gaussians. For example, in our code below, we have made sure that the mean (i.e., the center) of the second Gaussian is 5 units away from the first Gaussian, whose mean is 0. Also, we have made sure that the standard deviation of the second Gaussian is half of the first Gaussian, whose standard deviation is 1. By randomly generating 500 samples from the First Gaussian and 500 samples from the second Gaussian, we have generated the data for class 0 and class 1 in our newly born dataset. Below you can see the code for doing this:
And here you can see the our generated data from both Gaussians:

So now we have these data points as our training data. However, remember that our binary classifier is a supervised algorithm and just like any other supervised machine learning algorithm we need the ground truth for our training data. More specifically, since we have 2 classes, we would consider 500 zero’s for the data points belonging to class 0, and 500 one’s for the data points belonging to class 1.
You must remember that, for a binary classification problem, we tend to use the sigmoid() output function. A sigmoid() generates values between 0 and 1, and we would like to learn the weights in our ANN in such a way that the generated sigmoid() function, would output 1 for all the instances in class 1, and 0 for all the instances in class 0.
So, the code below, concatenates all the data points in 1 big numpy array, X, that is the entire training set. And then we generate the ground truth (i.e., the labels), namely, 500 zeros’s and 500 one’s as one big ground truth numpy array, called, Y.
The next step is, of course, to generate our weights randomly. They need to be rather small, as we would like the input of our sigmoid() function to be sort of close to 0 in the beginning of our training. Do you know why? Think about the back-propagation! This way the gradient of the sigmoid() with respect to its input (i.e., a point close to zero) would be quite high! Think about the slope of the tangent line, on a sigmoid function at a point close to 0. This can make learning faster and accelerate our convergence to a good model! I am not going to dig deeper into it but make sure you understand the concept of back-propagation as it is a crucial one for understanding ANNs and the way they learn!
Here comes the exciting part. This is where we will start the training process. Remember:
The learning task in here, means to find the correct weights that would force our sigmoid() function to produce 1’s for all the instances belonging to class 1 and 0’s for the others (class 1 and 0 are just names. You can say negative class and positive class). Remember that the input to the sigmoid() unit is nothing but LITERALLY a simple equation of a line! And the equation is :z = w_1x + w_0 and then the sigmoid() function literally squashes this line from both ends to +1 and 0.  So, the output of our neural network, y_{hat}, is equal to: sigmoid(w_1x + w_0), and is bounded in the range [0, +1].
So, it is clear that it is by the output of the sigmoid(), which happens to be the output of our ANN, that we can decide whether a given x, belongs to class 1 or 0. The training and visualization code is down below:
And beautiful enough, of all the 120 epochs that we have trained out network, down below, we can see the progress of the trained model at every 20 epochs. Note how the cross-entropy error is decreasing. Also, note how the black line, that is z, gets squashed and turned into a sigmoidal output whose output is between 0 and 1, that is y_{hat}. You can see how in the very beginning the separation is terrible but then it improves gradually!

And here is the trend of the Error as a function of our training epochs. As you can see, it is pleasantly decreasing!
Finally, here is all the code in one place:
On behalf of MLDawn, I do hope that this has been a helpful post 😉
Keep up the good work and good luck!
MLDawn

Leave a Comment

Your email address will not be published.