Why Do Interviewers Ask About it?
Questions about the bias-variance tradeoff are used very frequently in interviews for data scientist positions. They often serve to delineate a data scientist that is seasoned and knows their stuff versus one that is junior… and more specifically, as one who is unfamiliar with their options for mitigating prediction error within a model.
So what is it again?
So bias-variance tradeoff… ever heard of it? If not you’ll want to tune in.
The bias-variance tradeoff is a simple idea, but one that should inform many of the statistical analysis & modeling that you do, primarily when it comes to eliminating error from predictions.
Where error comes into play
When you create a model, your model will have some error. Makes sense! Nothing new here; what is new is the idea that said error is actually made up of two things… You guessed it, bias & variance! Sorry to drill this in so hard, but this reason that matters, is that once you understand the component pieces of your error, then you can determine a plan to minimize it.
There are different methods and approaches you can take to manage and minimize bias or variance, but the act of doing so comes with its considerations. Hence, why it is so pivotal for you as a data scientist to understand the effects of either.
Lets break down bias
Bias represents the difference between our prediction and actuals.
High bias vs low bias
A model that with high bias is one that would garner little from data to then generate predictions. A common phrase you might hear is that a high bias model is ‘over generalized’. It depends very little on the training data to determine its predictions, thus when it comes to generating accurate predictions on your test data… it performs very poorly.
There may be assumptions implicit within our approach that leads to a lack of attention given to those features that would allow a model to generate predictions with greater performance.
Conversely, low bias represents a model that is highly accurate. Thus, it’s something we’d clearly want to minimize.
What does variance mean for your model?
Variance is pretty much what it sounds like; variance has to do with the distribution of our predictions and how ‘variable’ they are. If you’ve ever heard the term ‘overfitting’; this is effectively an explanation of the outcomes of a high variance model.
What happens is, very different to a high bias model, a high variance model is one that ‘over depends’, you could say, on your training data. In fact, that model may perform very well on it’s training data. It may be fit so well to the training data that it appears like an excellent model at first glance, but at the moment you attempt to generalize your model to your test data… it does so very poorly. The model is fit far too closely to your training data.
Understanding the overlap between bias and variance
The below image is an excellent representation of the overlap of models that are high or low in variance or bias.
Let’s talk about situations in which bias is high: No matter the variation of prediction, the model is implicitly missing whatever signals it might need to interpret or leverage; and as a result is finding itself far from the bullseye.
In situations where bias is low we can see that predictions are at least centered on actuals.– whether variable or not, we’re directionally better off.
With high variation, we see that the outcomes are all over the place, clearly over fitting to the data it has seen before. While these outcomes appear directionally correct, they lack generalizability to new data… which should typically be the purpose behind building any model.
In instances of low variation, we can see that the predictions themselves vary significantly less.
Obviously each form of error occurs along a spectrum, but this visualization serves to cement the challenges of of this tradeoff.
Why is it difficult to have both?
When it comes to the design of your model, you will be forced to make certain decisions; and implicit in those decisions lies the act of leaning in one direction or the other.
Lets say you are working with a random forest algorithm, and in an effort to improve performance, you begin tuning hyper parameters… one of which is to add more and more trees and sampled variables.. while this would give you certain performance gains up to a point… what will happen over time is that your model will be far too familiar with the data it’s seen; and any subsequent call to generate predictions will likely treat this new data too similarly to that which it has seen.
You can also think about this from the perspective of the number of variables that are included, especially those that are categorical. The more inputs the more a model may understand about your training data, but potentially the less it will be capable of generalizing to data it has never seen. Again we see the consideration one might need to make in favor of mitigating either bias or variance.
So, we’ve thrown a variety of definitions around, talked about how they play together… but what’s the point of talking about this? I’d boil it all down to consideration. Without an awareness of the affects of model design on outcomes and the ability to define our error, we have no recourse to improve.
You now have greater insight into how your model design might affect their utility in the end. Use that insight, be methodical around your consideration, and build some awesome models!
I hope you enjoyed this, for more posts talking about machine learning, data science, and the like visit me at datasciencelessons.com or follow me on medium!
Happy data science-ing!