Deep Learning: Loss Functions Slideset

These slides focus on loss functions of neural networks.

1. Hello and welcome to this lecture on loss functions. In this lecture I will present different loss functions and explain how they work.
2. The loss function is an important part of the learning process, because it evaluates the model performance during training, and evaluates how much predictions deviate from actual values. The model is improved gradually thank to the loss function in combination with an optimizer that we will discuss afterwards. The loss is minimized during training. It is also possible to have multiple loss functions for one model – in that case one loss function for each output variable.
a. In general there are loss functions for regression and classification tasks.
3. Let’s start with regression. The first one we will discuss is Mean Squared Error. You know this from the regression lectures. It takes the difference of actual and predicted values and squares it. The squaring penalizes outliers strongly. Another one is Mean Absolute Error. This is taking the average differences of actual and predicted values into account. This loss function is more robust to outliers, but is more complicated to compute the gradients. Another less known error is the mean bias error. It takes the differences into account. But since the differences are not positive by default, like for squaring or taking the absolute value, here positive and negative differences could balance each other out. So be cautious with this error. For regression holds that the output layer has one node and a typical activation function is linear.
4. For classification tasks – binary cross entropy is the most common loss function and applicable for binary classification. The typical activation function is sigmoid, because it provides values in the range of zero to one. It penalizes if predicted probability is different from actual label. A nice feature is, that predictions with high confidence, that are found to be wrong. For the output layer holds the same as before. It should have one node. Here a typical activation function is sigmoid.
5. Another loss function for classification is Hinge loss, which is also called support vector machines loss. It is used for a maximum margin classifier in a binary classification problem. This loss adds a penalty when the actual class differs from the predicted class. The output layer has one node and a typical activation function is sigmoid.
6. A different type of problem is multi-label classification. Here, the most common loss function is multi-label cross entropy. For multi-label classification holds, that the output layer has n nodes, where n is the number of labels. The typical activation function for this kind of problem is softmax.
That’s it for this lecture. Thank you very much for watching and see you in the next one.