What will you learn?

Ask any machine learning expert! They will all have to google the answer to this question:

“What was the derivative of the Softmax function w.r.t (with respect to) its input again?”

The reason behind this forgetfulness is that Softmax(z) is a tricky function, and people tend to forget the process of taking its derivative w.r.t its input, z. We need to know this derivative in order to train an Artificial Neural Network. By the end of this post you will have learned the mechanism and the steps required to compute this tricky derivative!

What is a Softmax Unit?

Let’s consider a simple neural network, down below. So we have D-dimensional input data, and some fully connected connections with weights, and our 1 and only output layer. This output layer has only 3 neurons. These neurons, manipulate their inputs z_i using the Softmax function, S(z_i), and spit out the result, that is S(z_1), S(z_2), and S(z_3).

Now, let’s remind ourselves as to what the Softmax function really is. In general for an arbitrary vector Z = [z_1, z_2, ... , z_N] of inputs, the Softmax function, S, returns a vector S(Z), and the i^{th} element of this output vector is computed as follows:S(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}}

And let us remember that the sum of all S(z_i)‘s, for all i‘s is equal to 1:

\sum_{i=1}^{N} S(z_i) = 1

That is the beauty of the Softmax function, as its outputs could be treated as probabilities in a neural network. 

NOTE: They are NOT probabilities! But can be treated as a measurement for certainty in a Neural Network.

All of this, is beautifully shown down below:

What Makes the Derivative of Softmax Tricky!

The main confusion with this function is the dependencies between the elements of its input vector Z. So, for example, for computing S(z_1), you will need z_2, and z_3 as well. This is the case, because of the common denominator among all S(z_i)‘s, that is, \sum_{j=1}^{N} e^{j}. If you look below, you will see these dependencies beautifully shown with colorful arrows!

So, for example, if you needed to compute the derivative of S(Z) with respect to just z_1, since you have used z_1 for computing all S(z_1), S(z_2), and S(z_3), you will need to compute the derivative of all S(z_1), S(z_2), and S(z_3) w.r.t z_1 (NOT just the derivative of S(z_1) w.r.t z_1).

Below, I have elongated the neurons in our simple neural network, and demonstrated the mathematical operations in each and every one of them. You can see the dependencies by tracing the colored arrows down below:

Computing Each of these Derivatives Separately

Before digging into the math, you need to get comfortable with a few basic mathematical principles of taking derivatives:

Rule#1: The law of independence

\frac{\partial e^{z_i}}{\partial z_j}=0

Rule#2: The derivative of Exponentials

\frac{\partial e^{z_i}}{\partial z_i}=e^{z_i}

Rule#3: The derivative of fractions h(x)=\frac{f(x)}{g(x)}

\frac{\partial h(x)}{\partial x} = \frac{f'(x)\times g(x) - g'(x) \times f(x)}{g(x)^{2}}

Rule#4: The derivative of a sum is the sum of derivatives

\frac{\partial (f(x) + h(x))}{\partial x} = \frac{\partial f(x)}{\partial x} + \frac{\partial h(x)}{\partial x}

Knowing what we know now, we should be totally fine with the fact that if we wanted to find the derivative of the softmax function w.r.t any z, we would need to consider all of our  z‘s, namely in our small example, all z_1, z_2, and z_3 .

Let’s start with S(z_1). Down below, when computing \frac{\partial S(z_1)}{\partial z_1} we are basically considering the first neuron in our neural network, and take the derivative of its output, S(z_1), w.r.t  its input, z_1. Take a look at the steps down below:

So the first line uses rule number 3 from our derivative rules of fractions. And then in the second line, we can see how \frac{\partial e^{z_1}}{\partial z_1}=e^{z_1} as we are using rule number 2 . We are also using rule number 4, as the derivative of a sum is the sum of derivatives, meaning:

\frac{\partial (e^{z_1}+e^{z_2}+e^{z_3})}{\partial z_1}=\frac{\partial e^{z_1}}{\partial z_1} + \frac{\partial e^{z_2}}{\partial z_1} + \frac{\partial e^{z_3}}{\partial z_1}

And we can immediately use rule number 1 of independence, and conclude that in the equation above, only the first term survives and the second and the third term will become 0!

In the end, in yellow, you can see that when computing the derivative of the output of a Softmax neuron, S(z_1), w.r.t its direct input z_1, all we need to do is to S(z_1) \times (1 - S(z_1)) , which is neat and great!

Now, what about S(z_2)? We need to compute its derivative w.r.t z_1 as well, right? See how beautifully this works out, down below:

You can see that unlike the case with S(z_1), now the final result is: -S(z_1) \times S(z_2). Can you guess what the result would look like for S(z_3)? See down below:

Wow! Just like S(z_2), again the final result for the partial derivative of S(z_3) w.r.t z_1 is -S(z_1) \times S(z_3).

Can you see the emerging pattern yet? 🙂

So, is there a MAIN Derivative Rule?

So, it all boils down to the index of S(z_i), and z_j ! Meaning if we are computing \frac{S(z_i)}{\partial z_i} the rule is always: S(z_i) \times (1 - S(z_i))

However, if we are computing \frac{S(z_i)}{\partial z_j}, where we are taking the derivative of the output of neuron i, that is S(z_i), w.r.t the input of neuron j, that is z_j. In this case the rule changes to: -S(z_i) \times S(z_j). Below you can see all of this, beautifully and mathematically demonstrated:


Today, you have learned the basics regarding the famous Softmax function, that is commonly used in Artificial Neural Networks for the task of classification. You have learned about the dependencies between the elements inside the Softmax function and seen how this could make computing the gradient, a little bit tricky. I do hope that this has been helpful.

Until then,

On behalf of MLDawn,

Take care 😉

5 thoughts on “The Derivative of Softmax(z) Function w.r.t z”

  1. Pingback: Back-propagation with Cross-Entropy and Softmax – ML-DAWN

  2. Hi. Thank you for this. I now truly understand the softmax derivation.
    I have a question. Say the error of output S(Z1) w.r.t z1 is A. That of S(Z1) wrt to Z2 is B, and that of S(Z1) wrt to Z3 is C.

    So, what is the total error of the output S(Z1)? Do you add A, B, C or multiply them?

Leave a Comment

Your email address will not be published.