The Derivative of Softmax(z) Function w.r.t z

Home

/

The Derivative of Softmax(z) Function w.r.t z

What will you learn?

Ask any machine learning expert! They will all have to google the answer to this question:

“What was the derivative of the Softmax function w.r.t (with respect to) its input again?”

The reason behind this forgetfulness is that Softmax(z) is a tricky function, and people tend to forget the process of taking its derivative w.r.t its input, $z$ . We need to know this derivative in order to train an Artificial Neural Network. By the end of this post you will have learned the mechanism and the steps required to compute this tricky derivative!

What is a Softmax Unit?

Let’s consider a simple neural network, down below. So we have D-dimensional input data, and some fully connected connections with weights, and our 1 and only output layer. This output layer has only 3 neurons. These neurons, manipulate their inputs $z_i$ using the Softmax function, $S(z_i)$ , and spit out the result, that is $S(z_1)$ , $S(z_2)$ , and $S(z_3)$ .

Now, let’s remind ourselves as to what the Softmax function really is. In general for an arbitrary vector $Z = [z_1, z_2, ... , z_N]$ of inputs, the Softmax function, S, returns a vector $S(Z)$ , and the $i^{th}$ element of this output vector is computed as follows: $S(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}}$

And let us remember that the sum of all $S(z_i)$ ‘s, for all $i$ ‘s is equal to 1:

$\sum_{i=1}^{N} S(z_i) = 1$

That is the beauty of the Softmax function, as its outputs could be treated as probabilities in a neural network.

NOTE: They are NOT probabilities! But can be treated as a measurement for certainty in a Neural Network.

All of this, is beautifully shown down below:

What Makes the Derivative of Softmax Tricky!

The main confusion with this function is the dependencies between the elements of its input vector $Z$ . So, for example, for computing $S(z_1)$ , you will need $z_2$ , and $z_3$ as well. This is the case, because of the common denominator among all $S(z_i)$ ‘s, that is, $\sum_{j=1}^{N} e^{j}$ . If you look below, you will see these dependencies beautifully shown with colorful arrows!

So, for example, if you needed to compute the derivative of $S(Z)$ with respect to just $z_1$ , since you have used $z_1$ for computing all $S(z_1)$ , $S(z_2)$ , and $S(z_3)$ , you will need to compute the derivative of all $S(z_1)$ , $S(z_2)$ , and $S(z_3)$ w.r.t $z_1$ (NOT just the derivative of $S(z_1)$ w.r.t $z_1$ ).

Below, I have elongated the neurons in our simple neural network, and demonstrated the mathematical operations in each and every one of them. You can see the dependencies by tracing the colored arrows down below:

Computing Each of these Derivatives Separately

Before digging into the math, you need to get comfortable with a few basic mathematical principles of taking derivatives:

Rule#1: The law of independence

$\frac{\partial e^{z_i}}{\partial z_j}=0$

Rule#2: The derivative of Exponentials

$\frac{\partial e^{z_i}}{\partial z_i}=e^{z_i}$

Rule#3: The derivative of fractions $h(x)=\frac{f(x)}{g(x)}$

$\frac{\partial h(x)}{\partial x} = \frac{f'(x)\times g(x) - g'(x) \times f(x)}{g(x)^{2}}$

Rule#4: The derivative of a sum is the sum of derivatives

$\frac{\partial (f(x) + h(x))}{\partial x} = \frac{\partial f(x)}{\partial x} + \frac{\partial h(x)}{\partial x}$

Knowing what we know now, we should be totally fine with the fact that if we wanted to find the derivative of the softmax function w.r.t any $z$ , we would need to consider all of our $z$ ‘s, namely in our small example, all $z_1$ , $z_2$ , and $z_3$ .

Let’s start with $S(z_1)$ . Down below, when computing $\frac{\partial S(z_1)}{\partial z_1}$ we are basically considering the first neuron in our neural network, and take the derivative of its output, $S(z_1)$ , w.r.t its input, $z_1$ . Take a look at the steps down below:

So the first line uses rule number 3 from our derivative rules of fractions. And then in the second line, we can see how $\frac{\partial e^{z_1}}{\partial z_1}=e^{z_1}$ as we are using rule number 2 . We are also using rule number 4, as the derivative of a sum is the sum of derivatives, meaning:

$\frac{\partial (e^{z_1}+e^{z_2}+e^{z_3})}{\partial z_1}=\frac{\partial e^{z_1}}{\partial z_1} + \frac{\partial e^{z_2}}{\partial z_1} + \frac{\partial e^{z_3}}{\partial z_1}$

And we can immediately use rule number 1 of independence, and conclude that in the equation above, only the first term survives and the second and the third term will become 0!

In the end, in yellow, you can see that when computing the derivative of the output of a Softmax neuron, $S(z_1)$ , w.r.t its direct input $z_1$ , all we need to do is to $S(z_1) \times (1 - S(z_1))$ , which is neat and great!

Now, what about $S(z_2)$ ? We need to compute its derivative w.r.t $z_1$ as well, right? See how beautifully this works out, down below:

You can see that unlike the case with $S(z_1)$ , now the final result is: $-S(z_1) \times S(z_2)$ . Can you guess what the result would look like for $S(z_3)$ ? See down below:

Wow! Just like $S(z_2)$ , again the final result for the partial derivative of $S(z_3)$ w.r.t $z_1$ is $-S(z_1) \times S(z_3)$ .

Can you see the emerging pattern yet? 🙂

So, is there a MAIN Derivative Rule?

So, it all boils down to the index of $S(z_i)$ , and $z_j$ ! Meaning if we are computing $\frac{S(z_i)}{\partial z_i}$ the rule is always: $S(z_i) \times (1 - S(z_i))$

However, if we are computing $\frac{S(z_i)}{\partial z_j}$ , where we are taking the derivative of the output of neuron $i$ , that is $S(z_i)$ , w.r.t the input of neuron $j$ , that is $z_j$ . In this case the rule changes to: $-S(z_i) \times S(z_j)$ . Below you can see all of this, beautifully and mathematically demonstrated:

Conclusions

Today, you have learned the basics regarding the famous Softmax function, that is commonly used in Artificial Neural Networks for the task of classification. You have learned about the dependencies between the elements inside the Softmax function and seen how this could make computing the gradient, a little bit tricky. I do hope that this has been helpful.

Until then,

On behalf of MLDawn,

Take care 😉

Author: Mehran

Dr. Mehran H. Bazargani is a researcher and educator specialising in machine learning and computational neuroscience. He earned his Ph.D. from University College Dublin, where his research centered on semi-supervised anomaly detection through the application of One-Class Radial Basis Function (RBF) Networks. His academic foundation was laid with a Bachelor of Science degree in Information Technology, followed by a Master of Science in Computer Engineering from Eastern Mediterranean University, where he focused on molecular communication facilitated by relay nodes in nano wireless sensor networks. Dr. Bazargani’s research interests are situated at the intersection of artificial intelligence and neuroscience, with an emphasis on developing brain-inspired artificial neural networks grounded in the Free Energy Principle. His work aims to model human cognition, including perception, decision-making, and planning, by integrating advanced concepts such as predictive coding and active inference. As a NeuroInsight Marie Skłodowska-Curie Fellow, Dr. Bazargani is currently investigating the mechanisms underlying hallucinations, conceptualising them as instances of false inference about the environment. His research seeks to address this phenomenon in neuropsychiatric disorders by employing brain-inspired AI models, notably predictive coding (PC) networks, to simulate hallucinatory experiences in human perception.

Add a comment

By using form u agree with the message sorage, you can contact us directly now

Responses

Leave a Reply to Back-propagation with Cross-Entropy and Softmax – ML-DAWN

Back-propagation with Cross-Entropy and Softmax – ML-DAWN May 23, 2020

[…] our previous post, we talked about the derivative of the softmax function with respect to its input. We indeed […]

Reply
Mureithi Mbugua November 27, 2021

Hi. Thank you for this. I now truly understand the softmax derivation.
I have a question. Say the error of output S(Z1) w.r.t z1 is A. That of S(Z1) wrt to Z2 is B, and that of S(Z1) wrt to Z3 is C.

So, what is the total error of the output S(Z1)? Do you add A, B, C or multiply them?

Reply
1. Mehran November 28, 2021
  
  Thanks a lot. I am not sure what ‘ the error of output S(Z1) w.r.t z1 is A’ really means! Did you mean the derivative instead of error, perhaps?
  
  Reply
  1. Anonymous January 31, 2023
    
    Yes, That’s what he meant, and I’m still curious for the answer, is there a summation of derivatives?
    
    Reply
    1. Mehran April 2, 2023
      
      Unfortunately I am not sure if I follow. Just work it out manually! The answer should emerge pretty quickly.
      
      Reply
    2. Anonymous April 8, 2024
      
      I was curious about this too! It seems like a summation as described in this video https://youtu.be/znqbtL0fRA0?si=yOrHkZWy6WUJnWd8&t=2530
      
      Reply