Each neuron in an ANN has an Activation Function, which we denote with f(). At first this mathematical function, receives the input x (i.e., pre-activation) to the neuron. Next, it does some mathematical manupulation to it (i.e., f()), and finally spits out the result (i.e., f(x)).

**The golden question**: How should we choose this f()?

You might be slightly tempted to choose a simple linear function (Discussed in the previous post) for the neurons in your ANN, where for each neuron: f(x) = x.

**Congratulations!** Now you have multiple layers of cascaded linear units across your entire ANN! The result of many many nested lines is still a line! Thus, your entire ANN still produces linear functions. **Under-whelming indeed!**

We are interested in ANNs that can represent highly sophisticated non-linear functions, which is the type of problem we face in the real world! So, non-linearity is a desirable feature of an activation function!

Can we use a perceptron unit as our activation function across our ANN? The problem is that the perceptron unit is discontinuous at 0, and hence, non-differentiable at 0. As a result, this makes it unsuitable for gradient descent.\

In conclusion, our expectations from f() are two-fold:

**We want f() to be non-linear.** This makes the entire ANN a collection of nested non-linear functions, capable of representing some scary non-linear function.**We want f() to be continuous, and differentiable** with respect to its input. This makes the entire ANN trainable using gradient descent. Great stuff!