What will you learn?
Ask any machine learning expert! They will all have to google the answer to this question:
“What was the derivative of the Softmax function w.r.t (with respect to) its input again?”
The reason behind this forgetfulness is that Softmax(z) is a tricky function, and people tend to forget the process of taking its derivative w.r.t its input, . We need to know this derivative in order to train an Artificial Neural Network. By the end of this post you will have learned the mechanism and the steps required to compute this tricky derivative!
What is a Softmax Unit?
Let’s consider a simple neural network, down below. So we have D-dimensional input data, and some fully connected connections with weights, and our 1 and only output layer. This output layer has only 3 neurons. These neurons, manipulate their inputs using the Softmax function, , and spit out the result, that is , , and .
Now, let’s remind ourselves as to what the Softmax function really is. In general for an arbitrary vector of inputs, the Softmax function, S, returns a vector , and the element of this output vector is computed as follows:
And let us remember that the sum of all ‘s, for all ‘s is equal to 1:
That is the beauty of the Softmax function, as its outputs could be treated as probabilities in a neural network.
NOTE: They are NOT probabilities! But can be treated as a measurement for certainty in a Neural Network.
All of this, is beautifully shown down below:
What Makes the Derivative of Softmax Tricky!
The main confusion with this function is the dependencies between the elements of its input vector . So, for example, for computing , you will need , and as well. This is the case, because of the common denominator among all ‘s, that is, . If you look below, you will see these dependencies beautifully shown with colorful arrows!
So, for example, if you needed to compute the derivative of with respect to just , since you have used for computing all , , and , you will need to compute the derivative of all , , and w.r.t (NOT just the derivative of w.r.t ).
Below, I have elongated the neurons in our simple neural network, and demonstrated the mathematical operations in each and every one of them. You can see the dependencies by tracing the colored arrows down below:
Computing Each of these Derivatives Separately
Before digging into the math, you need to get comfortable with a few basic mathematical principles of taking derivatives:
Rule#1: The law of independence
Rule#2: The derivative of Exponentials
Rule#3: The derivative of fractions
Rule#4: The derivative of a sum is the sum of derivatives
Knowing what we know now, we should be totally fine with the fact that if we wanted to find the derivative of the softmax function w.r.t any , we would need to consider all of our ‘s, namely in our small example, all , , and .
Let’s start with . Down below, when computing we are basically considering the first neuron in our neural network, and take the derivative of its output, , w.r.t its input, . Take a look at the steps down below:
So the first line uses rule number 3 from our derivative rules of fractions. And then in the second line, we can see how as we are using rule number 2 . We are also using rule number 4, as the derivative of a sum is the sum of derivatives, meaning:
And we can immediately use rule number 1 of independence, and conclude that in the equation above, only the first term survives and the second and the third term will become 0!
In the end, in yellow, you can see that when computing the derivative of the output of a Softmax neuron, , w.r.t its direct input , all we need to do is to , which is neat and great!
Now, what about ? We need to compute its derivative w.r.t as well, right? See how beautifully this works out, down below:
You can see that unlike the case with , now the final result is: . Can you guess what the result would look like for ? See down below:
Wow! Just like , again the final result for the partial derivative of w.r.t is .
Can you see the emerging pattern yet? 🙂
So, is there a MAIN Derivative Rule?
So, it all boils down to the index of , and ! Meaning if we are computing the rule is always:
However, if we are computing , where we are taking the derivative of the output of neuron , that is , w.r.t the input of neuron , that is . In this case the rule changes to: . Below you can see all of this, beautifully and mathematically demonstrated:
Today, you have learned the basics regarding the famous Softmax function, that is commonly used in Artificial Neural Networks for the task of classification. You have learned about the dependencies between the elements inside the Softmax function and seen how this could make computing the gradient, a little bit tricky. I do hope that this has been helpful.
On behalf of MLDawn,
Take care 😉