Very interestingly, the double-descent phenomenon can explain why models with millions of parameters, say monstrous Artificial Neural Networks, can still accomplish a low test risk! This is amazing, as the previously known U-shape curve could not justify why these big models can still achieve such good generalizability!Â This explains why very rich deep neural networks tend to perform so well and generalize so beautifully. Such models (in modern interpolating regime) are over-parametrized beyond the interpolation point as opposed to the classical regime (i.e., any model with capacities below the interpolation threshold).Â

The authors highlight a very interesting insight:

Models with high capacity can achieve low test risk (i.e., high generalizability) because the capacity of the model **does not** __necessarily__Â reflect how well the model matches the inductive-bias appropriate for the problem at hand!

For the datasets used in the paper the authors argue that smoothness or regularity of a function is the appropriate inductive bias, which can be measured by a certain function space norm. In particular in such problems, modern monstrous deep learning architectures find low norm solutions, which implies smooth functions without mad oscillations. It is very interesting to note that choosing the smoothest function that fits the training data perfectly reminds us of Occam’s razor that states that:

Simplest explanation compatible with the observations should be preferred!

By considering models with high capacity, in a way, we are considering a larger set of candidate functions that not only fit the training data perfectly but also have smaller norm and thus are “simpler”. This is how the authors believe high-capacity models can improve the performance (i.e., generalizability) dramatically! For understanding functional norms, you can take a look at the theory of Reproducing Kernel Hilbert Spaces (RKHS) or the related theory of Sobolev spaces. These norms are similar to vector norms in their properties but they control the smoothness of the function.