MLSS Africa Resources

Ferenc's lecture notes:

Why Deep Learning Generalises
Stochastic Gradient Descent
Approximation with Deep Networks
Google Colab on Deep Linear Networks

Some Intuition on Neural Tangent Kernels
Implicit Regularisation of SGD
Information Theoretic Bounds for SGD

References and Reading List

Bad Global Minima Exist and SGD Can Reach Them
Bayesian Deep Learning and a Probabilistic Perspective of Generalization
Understanding Deep Learning Requires Rethinking Generalisation
A closer look at memorisation in Deep Networks
Reconciling modern machine learning practice and the bias-variance trade-off
Deep Double Descent: Where Bigger Models and More Data Hurt
Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
Git Re-Basin: Merging Models modulo Permutation Symmetries
Deep learning generalizes because the parameter-function map is biased towards simple functions
Benign overfitting in linear regression
Double Trouble in Double Descent: Bias and Variance(s) in the Lazy Regime
Triple descent and the Two Kinds of Overfitting: Where & Why do they Appear?
On the Origin of Implicit Regularization in Stochastic Gradient Descent
Stochastic Training is Not Necessary for Generalization
Implicit Regularization in Deep Matrix Factorization
Implicit Bias of Gradient Descent on Linear Convolutional Networks
A nice presentation by Suriya Gunasekhar
Finite Versus Infinite Neural Networks: an Empirical Study
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
Implicit bias of gradient descent for mean squared error regression with wide neural networks
Deep Equals Shallow for ReLU Networks in Kernel Regimes
Sharpness-Aware Minimization for Efficiently Improving Generalization & Towards Understanding Sharpness-Aware Minimization
Asymmetric Valleys: Beyond Sharp and Flat Local Minima
The lottery ticket hypothesis: Finding sparse, trainable neural networks
Proving the lottery ticket hypothesis: Pruning is all you need
Triple descent and the Two Kinds of Overfitting: Where & Why do they Appear?

Homework

Let's consider two models of data:
$$
f_1(x) = w_1 x + w_2
$$
with initial values $w_1=1$, $w_2=2$, and
$$
f_2(x) = 10\cdot w_3x + w_4
$$
with initial values $w_3=0.1$, $w_4=2$.

It is easy to verify that the two functions at these initial points in fact mathematically the same, and that the two models describe the same set of linear 1D functions.

Now let's consider observing a new datapoint $x = 7, y = 10$. We will update each model by taking a single gradient step trying to reduce the mean-squared error on this single datapoint, with learning rate $0.1$.

Calculate how $f_1$ and $f_2$ are going to change as a result of a single update step. Relatively speaking, do the slope and bias parameters change similarly in the two models.

If you have a solution, maybe nice plots, feel free to send them to me in an email: ferenc.huszar@gmail.com

inFERENCe

My lab

Blog

MLSS Africa Resources

Ferenc's lecture notes:

References and Reading List

Homework

MLSS Africa Resources

Ferenc's lecture notes:

Related blog posts:

References and Reading List

Homework