to machine learning—Part II
Model Tuning and
the Bias-Variance Tradeoff
The goal of modeling is to approximate real-life situations by identifying and encoding patterns in data. Models make mistakes if those patterns are overly simple or overly complex.
In Part 1, we created a model that distinguishes homes in San Francisco from those in New York. Now, we'll talk about tuning and the Bias-Variance tradeoff.
Models can be adjusted to change the way they fit the data. These 'settings' are called parameters. An example of a decision-tree parameter is the minimum node size, which regulates the creation of new splits. A node will not split if the number of data points it contains is below the minimum node size.
The tree from Part 1 had a minimum node size of one. It was very complex, had lots of splits, and overfit the data. To see why, let’s revisit how the decision tree was trained.
The simplest version of a decision tree is called a stump. Comprised of a single split, stumps are comprised of a single rule, such as “Every house whose elevation is above 34 feet is in San Francisco, and all others are in New York.”
Stumps take a binary view to the world and ignore complexity and nuance in the training data. This black-and-white interpretation of the world is prone to errors due to bias.
A model with too much bias systematically ignores relevant details and is wrong in consistent ways. The stump to the right incorrectly classifies all lower-elevation homes in San Francisco.
To decrease the error due to bias, you can add additional splits to the tree.
Additional splits allow the tree to take into account more complexity. You can add splits until a tree's leaf nodes contain only homes in either San Francisco or New York.
The question is, how does it perform on the test data?
As we observed in Part I, this fully-grown tree is not as accurate on test data. For example, with this tree the error rate is 12%.
This overly-complex tree suffers from errors due to variance.
High-variance models make mistakes by overfitting to the idiosyncrasies of the training data. They tend to be wrong in inconsistent ways.
To see exactly what's going on, let’s switch back to the training set and follow the creation of a single leaf node.
A tangible example of variance
This leaf node is the result of eight separate forks. Each fork divides the data set into smaller subsets, until the leaf node contains a single San Francisco home.
To see how repeated splitting leads to variance errors, we will show the series of forks as distribution dot plots.
At each fork, the split point is set such that the resulting branches are as homogeneous as possible.
Each split point is selected greedily, rather than taking into account what might be better later on. Splits earlier in the tree can have a cascading effect deeper in the tree.
The deeper you go into the branch, the less the data is available to create splits.
…At an extreme case (like this one), the final split is based on only two rather arbitrary data points.
The arbitrariness of the split is reflected in model accuracy. If you put the test data through the tree, five homes satisfy the rules along this branch. The model thinks these homes should be in San Francisco.
…but all five are in New York.
The final forks were made using very little data, so it's no surprise that the generalizations they make are incorrect. Patterns drawn from two homes are more likely to be flukes than anything real. The fact that the node is 100% wrong is surprising but useful.
Flukes are normal
This is not an isolated incident. For example, we could grow additional trees from random subsets of the training data.
From the training set of 250 homes…
… let's draw four random samples of 200 homes each and grow trees based on them.
The resulting trees all look reasonably different and also have single-home leaf nodes.
These seemingly-esoteric homes that may result in single-data-point leaf nodes are actually a normal part of any data set. They are an outcome of the method for fitting the model.
When the minimum node size parameter is one, the tree grows until every branch has a homogeneous leaf node.
For a given data set, growing the tree on a different set of homes changes what the branches overfit to, but overfitting still occurs. The data can also be a source of error. An example of data resulting in a biased model is non-response in polling. Models that overfit are unstable and sensitive to small changes in the training data and thus high variance.
The Bias-Variance Tradeoff
One way to address errors from overfitting is to impose limits on how a tree grows by changing the minimum-node-size threshold.
As the minimum-node-size threshold increases, there are fewer splits. The trees get less bushy.
The accuracy of the each tree improves as errors due to variance decrease.
As the minimum-node-size threshold continues to increase, the accuracy begins to deteriorate from error due to bias.
Until you get back to a stump.
A model that is overly-simplistic is just as problematic as one that is overly-scrupulous. Errors due to bias and those due to variance are distinct. Understanding the tradeoff between bias and variance (and how different model types let you balance the two) is foundational to modeling well.
The Trade-off in Abstract Terms
A common way to visualize the relationship between model complexity and accuracy is on a chart.
The relationship between a parameter like minimum node size and model error illustrates the tradeoff between bias and variance more explicitly.
When a model is less complex, it ignores relevant information, and error due to bias is high. As the model becomes more complex, error due to bias decreases.
On the other hand, when a model is less complex, error due to variance is low. Error due to variance increases as complexity increases.
Overall model error is a function of error due to bias plus error due to variance. The ideal model minimizes error from each.
You can actually show mathematically that error due to bias and error due to variance are distinct. Necessary caveats: This excludes irreducible error (the variance of error terms). Also, "Bias" is actually "Bias2."
Even at their optimal depth, single decision trees aren’t the best performing models. While trees are very easy to understand, the world is more complex than a bunch of if-then statements.
Nevertheless, decision trees can be used in aggregate to yield very strong results. We'll discuss these ensemble methods in Part III.
- Models approximate real-life situations using limited data.
- In doing so, errors can arise due to assumptions that are overly simple (bias) or overly complex (variance).
- Building models is about making sure there's a balance between the two.
Want to be notified when the next post is released?
Follow us on twitter:
A Visual Introduction to Machine learning — Part II is finally here! Hope you enjoy learning about Bias and Variance. https://t.co/GY8JnylS17— r2d3.us (@r2d3us) June 18, 2018