Link to: Overfitting to theories of overfitting

Summary

This bias-variance decomposition is always true for the squared loss. It’s just defining things in a clever way where when you expand the squares, the cross terms cancel because expressions have zero mean. The way the decomposition is interpreted in ESL is that more complex models have lower “bias” because they can fit more complex patterns but more “variance” because they are more sensitive to changes in data.

However, this decomposition is not a tradeoff because there is nothing that suggests these terms need to trade off. No fundamental law of functional analysis says that if one term is small, the other is large. In fact, there’s nothing that prevents both terms from being zero. I can certainly build models where some have low bias and high variance, some have high variance and low bias, and some are just right. It all depends on how you define the models and their complexity.

…

The advice people draw from the bias-variance boogeyman is downright harmful. Models with lots of parameters can be good, even for tabular data. Boosting works, folks! Big neural nets generalize well. Don’t tell people that you need fewer parameters than data points. Don’t tell people that there is some spooky model complexity lurking around every corner.

Use a test set to select among the models that fit your training data well. It’s not that complicated.

The author gives practical advice in another blog post:

The main goal of applied math is to guide practice. We want theories that, while not perfect, give reasonable guidelines. But the advice from generalization theory just seems bad. I swear that all of the following bullets were lessons from my first ML class 20 years ago and are still common in popular textbooks:

If you perfectly interpolate your training data, you won’t generalize.

High-capacity models don’t generalize.

You have to regularize to get good test error.

Some in-sample errors can reduce out-of-sample error.

Good prediction balances bias and variance.

You shouldn’t evaluate a holdout set too many times or you’ll overfit

This is all terrible advice!

…

What does “work” in practice? It’s hard to argue against this four-step procedure:

Collect as large a data set as you can

Split this data set into a training set and a test set

Find as many models as you can that interpolate the training set

Of all of these models, choose the model that minimizes the error on the test set

This method has been tried and true since 1962. You can say that step 4 is justified by the law of large numbers. Maybe that’s right. But there’s still a lot of magic happening in step 3.