So, just as a gentle reminder, the softmax function can be defined as:
And the interesting property of this function is that the sum of all the outputs of softmax, is always equal to 1:
Now, about the ground-truth vector, , we mentioned that it is a one-hot encoding vector. This means that for every ground-truth vector ONLY one element can be equal to 1 and all the other elements are equal to 0!
Finally, regarding the cross-entropy error function, the mathematical representation of this function is as follows:
In this definition of error, , is the natural logarithm. As for an example, let’s say we have 3 output neurons. For a given training example, the output vector of this neural network will have 3 elements in it. Let’s say the output vector is as follows:
You notice that these sum up to 1 as the property of softmax function. And, let’s say the ground-truth vector for the same input training example is as follows:
With s simple comparison between the network output vector and the ground-truth vector you can see that, the network thinks that the given training example belongs to class 2 (as 0.6 is the largest value and corresponds to class 2), however, the ground-truth says that the training example actually belongs to class 1 (as only the value corresponding to the first class is 1 and all the others are 0).
Now, let’s see how we can compute the cross-entropy error function:
which is equal to:
which is:
Now that we are comfortable with the whole setting, let’s see how we can derive the gradient of this error function with respect to the inputs of the softmax output function and apply back-propagation from scratch!
Great tutorial and explanation! Only thing i would fix are the images (equations) showing the rules of partial derivatives. They are overlapping the text
Thanks for the feedback. Duely noted!
Great explanation!
I am glad it was helpful 🙂
This is by far the best explanation on the internet. I’ve spent like 3 hours looking for answers and every website or video I’ve come by has repeatedly done incomprehensible shortcuts that convolute the process so much. This was a very by the textbook, simple derivation. Longer but way more understandable.
My suffering was identical to your when I was learning about this topic! This was the main reason why I created this tutorial!
Hi, thanks for the explanation, really clear and easy to follow. However, I don’t understand why we are calculating the derivative of the loss w.r.t. the softmax, instead of calculating the derivative of the loss w.r.t. the weights.
Here is the catch! The only way you can compute the derivative of lthe loss w.r.t the weights, is to go through the softmax function (according to the Chainrule).
I am doing a DataScience course, and apart from the uses of Softmax in classification problems, no-one talks anything more about it. Having understood back propagation to a reasonable extent on YT and such, I was very curious on how it works with Softmax, and this article explained it pretty clearly. Apart from Softmax, I am also interested in understanding the mechanics of MaxPool, Dropout, BatchNormalization, specifically wrt back propagation, any pointers?
Cool! Please be more specific about your questions 🙂
Finally, I found a good explanation for this topic. Other articles were way too difficult for a beginner like me to understand. Splendid JOB !!
Perfect 🙂