These slides focus on optimizers of neural networks.
1. Hello and welcome to this lecture on Optimizers.
2. This is the last missing piece of our model. We spoke about the loss function, which should be minimized during training. The weights of the nodes are updated to minimize the loss function, but the burning question is how. What you might do is brute force – just check all possible combinations of weights and pick the best one. But here you have the problem that the solution space is so overwhelming huge, that it would take extremely long to find the optimimum. So that is not the way to go forward. What other options do we have? Here the optimizers enter the scene. They implement an educated trial-and-error process, so that eventually the loss in decreased. In the next slides we will discuss different optimizers and how they minimize the loss.
3. The most common optimizer is gradient descent. I will explain the idea.
a. In the graph you can see the loss for one specific weight. You can calculate the loss for the initial weight. You get a certain value. What you are interested to know is where to go: increase or decrease the weight to decrease the loss. For this you can calculate the gradient for the specific weight. The gradient is the partial derivative for the specific weight. This is telling you which way to go and search on.
b. Here it tells us to increase the weight. We do so, set a slightly higher weight and get this loss value.
c. We repeat this step and see we still need to increase our weight.
d. Now if we do this again, we see that the loss is increasing.
e. Just to be sure, we increase again and see, that the loss is increasing even more. So we know what the overall winner is. This is the idea of gradient descent and it is widely spread over different fields of mathematics.
f. Here you can see the process. You start by initializing the weights. Then you calculate small changes of the weights and use the information to adjust the weights. If no minimum is found, you start the loop again: calculate the gradient again for the updated weight and possibly update the weights again. This procedure is repeated until the global minimum for the weights is found.
4. You might have a problem and be stuck in a local minimum.
a. You can see an example at the right side. It means you change the weights to either side and the loss increases, but there is still another minimum that has a much lower loss.
b. There are two points which should overcome this problem. First, the convex nature of the loss function itself. In the graph you can see the loss function, which might a mean squared error. It has a convex shape and this helps to avoid local minima.
c. The other solution relies on the learning rate, that we will discuss now.
5. The learning rate defines the size of the weight changes. The optimum learning rate is defined in a way, that it finds the minimum as reliable and as fast as possible. There are two targets – reliability and speed – that cannot be fulfilled at the same time.
a. So there are two things that might happen. First, the learning rate is too high. What happends is, that the steps are very large from one weight to the next and you risk to overshoot and miss the minimum. This you can see at the graph. The distance between two weight values is large and the minimum is not found.
b. The other problem that might occur is that the learning rate is too low. Then many steps are performed and you find the minimum for sure. But it takes much too long to get there, so the process is very time-consuming.
6. The are many more optimizers. I will only present a few of them.
a. One is Adagrad, which is able to adapt the learning rate depending on the features, so that the weights have different learning rates. You can expect that this finds the minimum faster. This technique works well if you have sparse datasets. That are datasets, that have empty values for many features. A problem here is that the learning rate decreases with time and sometimes gets too small, so that the learning is slow, which was the initial problem to overcome. Some other optimizers are based on Adagrad and are supposed to solve this problem. Among them are Adaprop and RMSprop.
b. A very famous optimizer is Adam. Adam is an acronym for Adaptive momentum estimation. It applies a momentum, which means that previous gradients are stored and applied for the current gradient calculation. This optimizer is very powerful and widespread.
c. There are many more, for example Stochastic Gradient Descent, Batch gradient descent to name a few.
That’s it for this lecture on optimizers. Thank you very much for watching and see you in the next one.
2. This is the last missing piece of our model. We spoke about the loss function, which should be minimized during training. The weights of the nodes are updated to minimize the loss function, but the burning question is how. What you might do is brute force – just check all possible combinations of weights and pick the best one. But here you have the problem that the solution space is so overwhelming huge, that it would take extremely long to find the optimimum. So that is not the way to go forward. What other options do we have? Here the optimizers enter the scene. They implement an educated trial-and-error process, so that eventually the loss in decreased. In the next slides we will discuss different optimizers and how they minimize the loss.
3. The most common optimizer is gradient descent. I will explain the idea.
a. In the graph you can see the loss for one specific weight. You can calculate the loss for the initial weight. You get a certain value. What you are interested to know is where to go: increase or decrease the weight to decrease the loss. For this you can calculate the gradient for the specific weight. The gradient is the partial derivative for the specific weight. This is telling you which way to go and search on.
b. Here it tells us to increase the weight. We do so, set a slightly higher weight and get this loss value.
c. We repeat this step and see we still need to increase our weight.
d. Now if we do this again, we see that the loss is increasing.
e. Just to be sure, we increase again and see, that the loss is increasing even more. So we know what the overall winner is. This is the idea of gradient descent and it is widely spread over different fields of mathematics.
f. Here you can see the process. You start by initializing the weights. Then you calculate small changes of the weights and use the information to adjust the weights. If no minimum is found, you start the loop again: calculate the gradient again for the updated weight and possibly update the weights again. This procedure is repeated until the global minimum for the weights is found.
4. You might have a problem and be stuck in a local minimum.
a. You can see an example at the right side. It means you change the weights to either side and the loss increases, but there is still another minimum that has a much lower loss.
b. There are two points which should overcome this problem. First, the convex nature of the loss function itself. In the graph you can see the loss function, which might a mean squared error. It has a convex shape and this helps to avoid local minima.
c. The other solution relies on the learning rate, that we will discuss now.
5. The learning rate defines the size of the weight changes. The optimum learning rate is defined in a way, that it finds the minimum as reliable and as fast as possible. There are two targets – reliability and speed – that cannot be fulfilled at the same time.
a. So there are two things that might happen. First, the learning rate is too high. What happends is, that the steps are very large from one weight to the next and you risk to overshoot and miss the minimum. This you can see at the graph. The distance between two weight values is large and the minimum is not found.
b. The other problem that might occur is that the learning rate is too low. Then many steps are performed and you find the minimum for sure. But it takes much too long to get there, so the process is very time-consuming.
6. The are many more optimizers. I will only present a few of them.
a. One is Adagrad, which is able to adapt the learning rate depending on the features, so that the weights have different learning rates. You can expect that this finds the minimum faster. This technique works well if you have sparse datasets. That are datasets, that have empty values for many features. A problem here is that the learning rate decreases with time and sometimes gets too small, so that the learning is slow, which was the initial problem to overcome. Some other optimizers are based on Adagrad and are supposed to solve this problem. Among them are Adaprop and RMSprop.
b. A very famous optimizer is Adam. Adam is an acronym for Adaptive momentum estimation. It applies a momentum, which means that previous gradients are stored and applied for the current gradient calculation. This optimizer is very powerful and widespread.
c. There are many more, for example Stochastic Gradient Descent, Batch gradient descent to name a few.
That’s it for this lecture on optimizers. Thank you very much for watching and see you in the next one.