The main confusion with this function is the dependencies between the elements of its input vector . So, for example, for computing , you will need , and as well. This is the case, because of the common denominator among all ‘s, that is, . If you look below, you will see these dependencies beautifully shown with colorful arrows!

So, for example, if you needed to compute the derivative of with respect to just , since you have used for computing all , , and , you will need to compute the derivative of all , , and w.r.t (**NOT just the derivative of w.r.t **).

Below, I have elongated the neurons in our simple neural network, and demonstrated the mathematical operations in each and every one of them. You can see the dependencies by tracing the colored arrows down below:

Pingback: Back-propagation with Cross-Entropy and Softmax – ML-DAWN