Deep Learning: Activation Functions Slideset

These slides focus on activation functions of neural networks.

1. Hello and welcome to this lecture on Activation Functions. In this lecture you will learn which different activation functions are available, how they work and when you should use them.
2. There are several activation functions. I will show you the most prominent ones. Let’s start with a Rectified linear unit, or short ReLu activation function. It simply returns zero for all negative input values. If the input value is positive, its value is returned. You can see the function definition phi = max(0, X). This is the most common activation function. It has a non-linear behavior and its derivative is zero for negative inputs, and 1 for positive inputs.
3. Leaky rectified linear unit is a variant of Rectified linear unit activation function. You can see the formula and the graph. For input values below zero, the output is alpha times the input value. And for x larger or equal to zero you get x as the response. The difference lies in the apha value. If you choose 0 as alpha value, you get the classical Rectified Linear unit. Typically alpha is in the order of 0.01. The difference to classical ReLu is highlighted in red in the graph. There is a small slope. This results in a small gradient, even for negative input values. There is an ongoing debate whether this has some advantages over classical ReLu. The results are not conclusive at this point, so there is no general advice to use the one or the other.
4. Hyperbolic tangent can account for non-linear behavior, but the problem is that it has only a narrow range of steep gradient, for all other areas it has a derivative very close to zero. That makes it difficult to improve the weights through gradient descent and you might have the problem of vanishing gradient. The activation values range between -1 and +1.
5. Sigmoid has a very similar shape and the same behavior as hyperbolic tangens. The main difference is that the activation values are in the range of zero to one. For this reason it is a good choice if you want to predict probabilities.
6. Softmax is another activation function. It is often used when you want to solve multi-label classification tasks. The reason is that you get a number of outputs, that is equal to the number of classes in your classification task. The nodes that pass their weights to the softmax activation function have different values. What you want to get out, are probabilities for several classes. So you want to have values that sum up to 1. Exactly this provides the softmax activation function based on the shown formula. Let’s see an example:
a. The nodes have these weights, one, four, two, and three. These weights are passed to the softmax activation function. The calculation is straightforward. For each weight you calculate e to the power of the corresponding xi. You also calculate the sum of all these e to the power of xi terms. Finally you just divide each individual e to the power of x term by the sum of all terms and get probabilities of the classes.
b. These probabilities always add up to one. Typically you choose the class with the highest probability as the predicted class.
That’s it for this lecture. Thank you very much for watching and see you in the next one.