# MLSS Africa Resources

## Ferenc's lecture notes:

- Why Deep Learning Generalises
- Stochastic Gradient Descent
- Approximation with Deep Networks
- Google Colab on Deep Linear Networks

## Related blog posts:

- Some Intuition on Neural Tangent Kernels
- Implicit Regularisation of SGD
- Information Theoretic Bounds for SGD

## References and Reading List

- Bad Global Minima Exist and SGD Can Reach Them
- Bayesian Deep Learning and a Probabilistic Perspective of Generalization
- Understanding Deep Learning Requires Rethinking Generalisation
- A closer look at memorisation in Deep Networks
- Reconciling modern machine learning practice and the bias-variance trade-off
- Deep Double Descent: Where Bigger Models and More Data Hurt
- Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
- Git Re-Basin: Merging Models modulo Permutation Symmetries
- Deep learning generalizes because the parameter-function map is biased towards simple functions
- Benign overfitting in linear regression
- Double Trouble in Double Descent: Bias and Variance(s) in the Lazy Regime
- Triple descent and the Two Kinds of Overfitting: Where & Why do they Appear?
- On the Origin of Implicit Regularization in Stochastic Gradient Descent
- Stochastic Training is Not Necessary for Generalization
- Implicit Regularization in Deep Matrix Factorization
- Implicit Bias of Gradient Descent on Linear Convolutional Networks
- A nice presentation by Suriya Gunasekhar
- Finite Versus Infinite Neural Networks: an Empirical Study
- Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
- Implicit bias of gradient descent for mean squared error regression with wide neural networks
- Deep Equals Shallow for ReLU Networks in Kernel Regimes
- Sharpness-Aware Minimization for Efficiently Improving Generalization & Towards Understanding Sharpness-Aware Minimization
- Asymmetric Valleys: Beyond Sharp and Flat Local Minima
- The lottery ticket hypothesis: Finding sparse, trainable neural networks
- Proving the lottery ticket hypothesis: Pruning is all you need
- Triple descent and the Two Kinds of Overfitting: Where & Why do they Appear?

## Homework

Let's consider two models of data:

$$

f_1(x) = w_1 x + w_2

$$

with initial values $w_1=1$, $w_2=2$, and

$$

f_2(x) = 10\cdot w_3x + w_4

$$

with initial values $w_3=0.1$, $w_4=2$.

It is easy to verify that the two functions at these initial points in fact mathematically the same, and that the two models describe the same set of linear 1D functions.

Now let's consider observing a new datapoint $x = 7, y = 10$. We will update each model by taking a single gradient step trying to reduce the mean-squared error on this single datapoint, with learning rate $0.1$.

Calculate how $f_1$ and $f_2$ are going to change as a result of a single update step. Relatively speaking, do the slope and bias parameters change similarly in the two models.

If you have a solution, maybe nice plots, feel free to send them to me in an email: ferenc.huszar@gmail.com