When it comes to building statistical models, we do so with the purpose of understanding or approximating some aspect of our world.
The concept of the general modeling framework lends well to breaking down the purposes and approaches that we might take to generate said understanding.
What is the General Modeling Framework?
Take a look at the general modeling framework as depicted by the below formula:
y = f(x) + e
- y: the outcome variable/whatever we’re trying to better understand
- x: the independent variable(s) or whatever we’re using to explain y
- f(): the function that when applied to x approximates the value of y
- e: the error or distance between our explanation of Y through our function of X in comparison to actual Y. Eg: everything we can’t explain through our f(X)
A couple of key takeaways here. What makes up y, is our understanding of y as a function of x with error sprinkled on top. Our effort to identify that function is what modeling is all about. If you’ve heard about signal & noise, the function is the signal, the sign, the indicator and the noise is the error, the variation, etc.
Applications of this framework
Let’s run through two purposes or approaches you can leverage in conjunction with this framework to guide your modeling process.
As far as the modeling mechanics, much of it will remain the same. The core difference here being the philosophy that’s guiding your process.
Explanation or Prediction
Modeling for Explanation
When it comes to modeling for explanation, the driving force is that we’re trying to understand which variables potentially cause an outcome or relate to it.
With that as the preface, let’s jump into some exploratory data analysis that would kick off our process of modeling for explanation.
Exploratory Data Analysis
Let’s kick our process off with exploratory data analysis (or EDA as we call it in the biz 😉 ). The intention of EDA is that we precede our modeling process with a series of activities that lend to better understanding each of the independent and dependent variables we seek to model and at a high level how they may relate to one another.
I have pulled down a house prices dataset from kaggle. You can find that here: https://www.kaggle.com/shree1992/housedata/data
Let’s take a look at the housing data
We will typically kick this process off with a series of functions that give us a quick perspective into the data.
Glimpse or Str
Both glimpse and str will give you a view of the fields, their datatypes, the dimensions, & samples.
We will now look at head to get a look at the first several rows of our dataset.
Lets visualize distribution
We’ll load up
ggplot2 and make a histogram to look at a few of these variables
library(ggplot2) ggplot(housing, aes(x = bathrooms)) + geom_histogram(binwidth = 1)
geom_histogram we can see the count of homes that fall into each numeric category. We can see the greatest number of homes has 2.5 bathrooms.
Lets do the same thing for a few more variables for example’s sake.
Below you’ll see the distribution of square feet.
We can see a slightly right skewed distribution.
Lets visualize home price.
As you can see it’s a bit right skewed due to homes at some of those extreme prices.
Visualizing skewed data
One thing to keep in mind when visualizing distributions and there is skew is visualizing the log10 distribution of the variable.
The beauty of taking the log is that it preserves the order of values.
Lets take a quick look!
We saw the square foot living distribution had some right skew, so let’s visualize that.
housing %>% mutate(sqft_living_log = log10(sqft_living))%>% ggplot(aes(x = sqft_living_log)) + geom_histogram()
As we can see above, we have a normally distributed log value for square footage. This makes it easier to compare some of these lower values with those that exist along the greater extremes of the right tail.
Lets visualize variable combinations
Lets walk through a similar process now with multiple variables involved. This will allow us to get an idea of how these variables relate to one another.
We’ll kick it off first with a scatter of sqft & price.
housing %>% ggplot(aes(x = sqft_living, y = price)) + geom_point()
These two variables have a linear relationship and correlate at .43
Now we know that both of these variables are right skewed, let’s visualize them again after converting them to their log10 selves.
housing %>% mutate(sqft_living_log = log10(sqft_living), price_log = log10(price))%>% ggplot(aes(x = sqft_living_log, y = price_log)) + geom_point()
Revisit modeling for evaluation
Now that we’ve gone through some EDA, lets revisit the idea of modeling for evaluation. The purpose is to understand which factors explain y, or as it relates to the housing data, how variables (sqft, # of bathrooms, etc.) might explain y.
When modeling something to keep in mind is that we aren’t going to understand how error is generated… the purpose here is to derive the function by making an assessment of the relationship between X & Y. That function as we called out earlier is the signal, thus modeling is the process we take to separate signal & noise.
Modeling for prediction
Once we have come to an understanding of F Once you have teased out an interpretation of the function (of X), you can apply it to other data (another X) to generate predictions.
while the EDA we engage in will be largely the same, the intent/use will likely be different. Again, this difference is highlighted by the intent to generate predictions using this historic relationship between X & Y.
When modeling for prediction, we still find ourselves in the predicament of not knowing the function or error, which we still need to separate and understand.
The bigger thing is that when modeling for explanation we care a lot about the form of our function, whereas when modeling for prediction we don’t…. what we care about in that case is whether or not our predictions are accurate
Hopefully this introduction to the general modeling framework sheds some light on how you think about modeling.
Happy data science-ing!