So, just as a gentle reminder, the softmax function can be defined as:

And the interesting property of this function is that the sum of all the outputs of softmax, is always equal to 1:

Now, about the ground-truth vector, , we mentioned that it is a one-hot encoding vector. This means that for every ground-truth vector **ONLY **one element can be equal to 1 and all the other elements are equal to 0!

Finally, regarding the cross-entropy error function, the mathematical representation of this function is as follows:

In this definition of error, , is the natural logarithm. As for an example, let’s say we have 3 output neurons. For a given training example, the output vector of this neural network will have 3 elements in it. Let’s say the output vector is as follows:

You notice that these sum up to 1 as the property of softmax function. And, let’s say the ground-truth vector for the same input training example is as follows:

###### With s simple comparison between the network output vector and the ground-truth vector you can see that, the network thinks that the given training example belongs to class 2 (as 0.6 is the largest value and corresponds to class 2), however, the ground-truth says that the training example actually belongs to class 1 (as only the value corresponding to the first class is 1 and all the others are 0).

Now, let’s see how we can compute the cross-entropy error function:

which is equal to:

which is:

Now that we are comfortable with the whole setting, let’s see how we can derive the gradient of this error function with respect to the inputs of the softmax output function and apply back-propagation from scratch!

MoinGreat tutorial and explanation! Only thing i would fix are the images (equations) showing the rules of partial derivatives. They are overlapping the text

mehran@mldawn.comThanks for the feedback. Duely noted!