Regression Assumptions

May 12, 2019 11 min read R

Overview:

In the real world, you would normally start with a data set and attempt to find a model that best fits that problem; however, there is a lot of value in starting with a data set that you know will exactly fit a model and add complexity to see the effects of the assumptions made by a model. This will be our approach as we look at linear regression models, the assumptions they make about data, and the effects of those assumptions. Helpful models can absolutely be built outside of these assumptions, and almost always are, but it is important to see the isolated effects of different scenarios.

Problem:

Imagine that at the end of of the year, you are reviewing your bank statement and you look at how much you spent on lunch every day. Assume that you track your food intake and write down what you eat for each meal. From your bank statement you can see the total cost of your lunch and from your caloric tracking you can see what items you ate each day, but you don’t know how much each item cost. Our goal is to build a linear regression model that will calculate how much each item cost. We will then add in additional complexities to the model and see their effects. Linear models make these assumptions about the dataset:

The cost of each item doesn’t change day-to-day
Purchasing more of an item does not change the cost of the item
Purchasing an item doesn’t affect the cost of the other items
The total cost is dependent solely dependent on the items purchased

We will show the affects of running a model on a data set that breaks each of these. This is is not an exhaustive list of scenarios, but this will be our scope.

Dataset

Our dataset will represent what we ate for lunch each day. Let us assume that each day we purchased:

1 or 2 sandwiches
Up to 2 apples
The possibility of a bag of chips
Up to 3 cookies (because we can all dream).
$1 tip (a constant)

#Set our seed to make sure our code is repeatable with consistent randomness
set.seed(101)

#Create our dataset
dataset<-as.data.frame(cbind(
  sandwich_count = sample(1:2, 100, replace=TRUE)
  , apple_count = sample(0:2, 100, replace=TRUE)
  , chip_count = sample(0:1, 100, replace=TRUE)
  , cookie_count = sample(0:3, 100, replace=TRUE)
))

#Peek at our dataset to make sure everything looks good
kable(head(dataset))

sandwich_count	apple_count	chip_count	cookie_count
1	1	1	1
1	0	0	3
2	1	1	2
1	2	0	2
1	0	0	0
1	2	1	2

Now that we have our dataset, let us calculate the cost total cost for each day. Here are the menu prices:

Sandwhich: $4
Apple: $1.50
Chips: $2.50
Cookie: $.50

Before you start wondering how we know the costs of each item (which is what we are trying to find out), we are just calculating the total cost for each lunch which would have come from our bank statement. We will only show the model the count of each item and the total cost.

#create a total column and calculate total based on cost
dataset$total<-(
  dataset$sandwich_count*4 
 + dataset$apple_count * 1.5
 + dataset$chip_count * 2.5
 + dataset$cookie_count * .5
 + 1
)

#Peek at our dataset to make sure everything looks good
kable(head(dataset))

sandwich_count	apple_count	chip_count	cookie_count	total
1	1	1	1	9.5
1	0	0	3	6.5
2	1	1	2	14.0
1	2	0	2	9.0
1	0	0	0	5.0
1	2	1	2	11.5

Build the model

In our context, the linear model will look like:

Total = (tip) + (cost of item A)x(number of item A) + (cost of item B)x(number of item B)…

And the cost of each item will be the coefficients that the model is going to find.

#Build the model
lunch_model<-lm(total~ sandwich_count + apple_count + chip_count + cookie_count, data = dataset)

#Look at the coefficients (Intercept will be the amount of our tip)
kable(lunch_model$coefficients, col.names = "Cost")

	Cost
(Intercept)	1.0
sandwich_count	4.0
apple_count	1.5
chip_count	2.5
cookie_count	0.5

The coefficients (cost of the item) all match the cost of each item exactly. This is not at all a surprise since we built our dataset within the assumptions of linear models. As long as we operate precisely whithin all the assumptions, the model will be exact.

Let us run a quick prediction to make sure that everything looks good. We will have 1 sandwich, and 2 apples (plus tip should be $8):

dataset_test<-as.data.frame(cbind(
  sandwich_count = 1
  , apple_count = 2
  , chip_count = 0
  , cookie_count = 0
))

predict(lunch_model, dataset_test)%>%
  as.numeric()

## [1] 8

Now, let us change the dataset to incorporate more realistic scenarios that operate outside of these assumptions to see how it affects the model.

1. Item Cost Increase

The first assumption we will break is that the cost of each item doesn’t change day-to-day. Let us assume that halfway through the year, our lunch shop increased the price of their cookie from $.50 to $1. Beyond the tragedy of costing more, this change pushes us outside of the realm of the assumptions made by a linear model that the price of each item is constant. Again, this doesn’t mean that we can’t use the model, only that it won’t fit perfectly as before. So let us examine the isolated effects of this change by looking at the coefficients if we alternate between not buying anything and only buying 1 cookie (and no tip).

#dataset alternating between 1 and 2 cookies
dataset$cookie_count <-rep(0:1, 50)

#Update total to show change to price
dataset$total_pricechange<-(dataset$cookie_count * c(rep(.5, 50), rep(1, 50)))

#Build the model
lunch_model_pricechange<-lm(total_pricechange ~ cookie_count, data = dataset)

#Look at the coefficients (Intercept will be the amount of our tip)
kable(round(lunch_model_pricechange$coefficients, 2), col.names = "Cost")

	Cost
(Intercept)	0.00
cookie_count	0.75

We can see that the model estimated the cost of a cookie as halfway between $.5 and $1. It is not a coincidence that the cost of the cookie changed halfway through the year and the coefficient is halfway between the old cost and the new cost. If it had changed later in the year, the cost would be closer to the original cost as we would have paid that price more often.

We can do a quick check where it changed 3/4ths of the way through the year:

	Cost
(Intercept)	0.000
cookie_count	0.630
Exact weighted average:	0.625

Let us see how the price change half way through the year effects our model knowing that this situation is not perfectly modeled by linear regression:

dataset$total_pricechange<-(
  dataset$sandwich_count*4 
 + dataset$apple_count * 1.5
 + dataset$chip_count * 2.5
 + dataset$cookie_count * c(rep(.5, 50), rep(1, 50))
 +  1
)

lunch_model_pricechange<-lm(total_pricechange~ sandwich_count + apple_count + chip_count + cookie_count, data = dataset)

kable(round(lunch_model_pricechange$coefficients, 2), col.names = "Cost")

	Cost
(Intercept)	0.96
sandwich_count	4.03
apple_count	1.48
chip_count	2.54
cookie_count	0.74

The cost estimates for each item are very close, but they are slightly off. Now that our data no longer has linear coefficients, the model is going to attribute some of the changes to other items as it tries to minimize the squared error. Looking closely at the cookie coefficient, it is right about $.75, which we had seen before.

In order to check our model, lets look at our previous lunch prediction which didn’t include a cookie, so it should be affected by the price increase (remember that it cost $8)

predict(lunch_model_pricechange, dataset_test)%>%
  as.numeric()%>%
  round(2)

## [1] 7.95

It still comes out to be $8, so the model still works okay in that situation. Now let us see what the model would show if we had a cookie. At the original price that would have been $8.50 and the new price would be $9

dataset_test_wcookie<-dataset_test
dataset_test_wcookie$cookie_count<-1
predict(lunch_model_pricechange, dataset_test_wcookie)%>%
  as.numeric()%>%
  round(2)

## [1] 8.69

As we would expect: halfway between the two prices (minus 1 cent). Lets take a minute and consider the consequences of this alteration. Assuming this is a predictive model, for each cookie we purchase in the future, our predictions will be under by $.25. In this simple example, that might not seem like a big deal, but as the complexity and cost of errors increases, this could have a major effect.

There are a few major take-aways from this change. The first is that the number of cookies we bought at each price will directly affect the coefficient, and ultimately the prediction. The second is that if this is a permanent price change (as opposed to a periodic promotional price), our predictions will be continually off and will need a retrain to be more precise. This is where knowing the shop enough to remember that they changed their prices (domain expertise) or seeing that how far the model is off changed at a specific point (data exploration).

As we can see from the chart below, our predictions before the change were mostly under and the predicitons from after are mostly over with a clear distinction between the two prices.

2. Buy 2 And Save…

Let us assume that instead of changing the price of cookies part way through the year, cookies had a variable price, depending on how many you purchased. 1 cookie costs $1 but 2 cookies cost $.75 each. In our imaginary life where we eat cookies on a regular basis, lets assume that we alternate between 1 and 2 cookies repeatedly.

#Alternate between 1 and 2 cookies
dataset$cookie_count <-rep(1:2, 25)

#1 cookie costs $1 but 2 or more cookies cost $.75 each
dataset$total_buy2<-(
dataset$cookie_count * ifelse(dataset$cookie_count== 1, 1, .75)
)

#Build the cookie model
lunch_model_buy2<-lm(total_buy2 ~ cookie_count, data = dataset)

#Look at the model
kable(lunch_model_buy2$coefficients, col.names = "Cost")

	Cost
(Intercept)	0.5
cookie_count	0.5

Interestingly, the model assumed that each cookie cost $.50 and we gave a $.50 tip. If we chart this out, it actually makes sense why that is. The difference in cost between 1 cookie and 2 cookies is $.50, so the model assumed this is how much a single cookie cost and the rest it counted as tip:

Before we go back to the full lunch, let us look at increasing the number of cookies we buy each day. We will increase from 1 to 4 each time we buy lunch (just trying to make it through the week).

#Cycle from 1 to 4 cookies
dataset$cookie_count <-c(rep(1:4, 25))

#1 cookie costs $1 but 2 or more cookies cost $.75 each
dataset$total_buy2_2<-(
dataset$cookie_count * ifelse(dataset$cookie_count== 1, 1, .75)
)

#Build the cookie model
lunch_model_buy2_2<-lm(total_buy2_2 ~ cookie_count, data = dataset)

#Look at the model
kable(lunch_model_buy2_2$coefficients, col.names = "Cost")

	Cost
(Intercept)	0.250
cookie_count	0.675

The model starts to spread out the error across all of the scenarios in order to split the difference based on the number of cookies at each price. One important thing to note is that the cookie coefficient is actually lower than the cost of the cookie, even when buying more than 2. The remaining cost of the cookie was bundled proporitonally into the constant (the tip) based on how many cookies we bought at each price.

Lets take what we just saw from the previous coefficients and try to imagine what we should expect from the full lunch model. We know that the tip will probably be about $.25 more than the $1 we actually put in, and the cookie will be less than even the $.75 we pay for 2 or more cookies.

dataset$total_buy2_full_lunch<-(
  dataset$sandwich_count*4 
 + dataset$apple_count * 1.5
 + dataset$chip_count * 2.5
 + dataset$cookie_count * ifelse(dataset$cookie_count== 1, 1, .75)
 + 1
)

#Build the cookie model
lunch_model_buy2_full_lunch<-lm(total_buy2_full_lunch ~ sandwich_count + apple_count + chip_count + cookie_count, data = dataset)

#Look at the model
kable(round(lunch_model_buy2_full_lunch$coefficients, 2), col.names = "Cost")

	Cost
(Intercept)	1.28
sandwich_count	3.99
apple_count	1.50
chip_count	2.49
cookie_count	0.68

And that is exactly what we see happened.

So what are the implications of this and how is this different than the previous scenario? When there was a discrete prices change, as in our previous section, the direction of the error was affected by the whether the price went up or down. Also, that change was proportional to how much the cookie changed in price and how many times we purchased the cookie at each price. In fact, each of those things still holds true. The real difference in this type of situation is that it didn’t just affect the cookie coefficient, as in the first situation, but it affected the constant. This means that our model is affected regardless of if we actually purchase a cookie. It also has the effect of possbily being under or over our prediction, depending on the number of cookies we purchase.

Note: it should be pointed out that if we had presented the model some lunches where we didn’t buy any cookies, it wouldn’t have had such a drastic impact, but for demonstrating the effects of these scenarios, these were omitted.

Additional Sections to be completed.

Data Analysis Machine Learning R