The Linear Model 3: Multiple predictors

In the previous tutorial, we practised fitting a linear model with a single predictor to see if we can say anything about the length of a penguin’s flipper if we know something about its body mass. This time, we will build more sophisticated models with more than one predictor. To do that, we’ll make use of the understanding of multiple-predictor linear models that you gained in lecture 9.

Warm-up quiz

Before we start, let’s warm up by answering a few quiz questions!

As you know, the general equation for the linear model with n predictors is

\[y_i = b_0 + b_1\times x_{1_i} + b_2\times x_{2_i} + b_3\times x_{3_i} + \dots + b_{n-1}\times x_{n-1_i} + b_n\times x_{n_i} + e_i\]

QUESTION 1

How many slopes will there be in a model with 4 predictors?

Correct!That’s not right…

QUESTION 2

How many intercepts will there be in a model with 3 predictors?

Correct!That’s not right…

QUESTION 3

In general terms, how many b coefficients will there be in a model with n predictors? (do not include white spaces)

n + 1

Correct!That’s not right…

QUESTION 4

What is the geometrical representation of the linear model with 2 predictors?

2D plane in a 3D space

Correct!That’s not right…

QUESTION 5

What does each slope represent?

Relationship between a given predictor and the outcome after accounting for all other predictors

Correct!That’s not right…

Setting Up

As usual, all you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.

Task 1

Create a new week_10 project and open it in RStudio. Then, create two new folders in the new week_10 folder: r_docs and data. Finally, open a new R Markdown file and save it in the r_docs folder. Get into the habit of creating new code chunk in the file for each of the tasks below.

Remember, you can add new code chunks by:

Using the RStudio toolbar: Click Code > Insert Chunk
Using a keyboard shortcut: the default is Ctrl + Alt + I (Windows) or ⌘ Command + Alt + I (MacOS), but you can change this under Tools > Modify Keyboard Shortcuts…
Typing it out: ```{r}, press Enter, then ``` again.
Copy and pasting a code chunk you already have (but be careful of duplicated chunk names!)

Task 2

In this tutorial, just like in the previous one, we’ll continue using the penguins data set from the palmerpenguins package. You will also need the packages tidyverse, broom, and parameters (install it if you don’t have it yet; watch out for spelling and upper/lower case letters!). Put the code that loads all of these in the setup chunk of your Rmd file and run it.

library(palmerpenguins)
library(tidyverse)
library(broom)
library(parameters)

Data

Task 3

Just like last time, save the penguins data set in your environment as peng_data to avoid having to type out palmerpenguins::penguins every time you want to use it.

peng_data <- palmerpenguins::penguins

Task 4

Remind yourself of what’s in the data by looking at a rough summary.

summary(peng_data)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

Single-predictor model

Last time we looked at how we can use information about a penguin’s body mass to make a more accurate prediction about the length of its flippers than simply guessing the mean flipper length for any given specimen.

Task 5

Let’s take a look at this model again.

Task 5.1

Fit the model to the data using the lm(), saving it’s output as m_mass

m_mass <- lm(flipper_length_mm ~ body_mass_g, peng_data)

# alternatively
m_mass <- peng_data %>%
  lm(formula = flipper_length_mm ~ body_mass_g, data = .)

Task 5.2

Use broom::tidy() to get a neat tibble of model summary statistics for this model. Save it as m_mass_summary

m_mass_summary <- m_mass %>%
  broom::tidy()

Task 5.3

Inspect the results and answer the following questions.

QUESTION 6

What kind of relationship between body mass and flipper length does the model describe

Positive

Correct!That’s not right…

QUESTION 7

By how many millimetres is flipper length predicted to increase for every 10 gramme increase in a penguin’s body mass? (give answer to 2 decimal places)

0.15

Correct!That’s not right…

QUESTION 8

Is this a statistically significant relationship (assuming an α = .05)?

Yes

Correct!That’s not right…

Task 5.4

Use broom::glance() to get the model fit statistics for m_mass and answer the quiz question.

QUESTION 9

What proportion of total variance in flipper length is accounted for by body mass? (give answer as % to 1 decimal place, such as 77.2)

75.9

Correct!That’s not right…

First, let’s have a look at the output of broom::glance().

m_mass %>% broom::glance()

# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl>
1     0.759         0.758  6.91     1071. 4.37e-107     1 -1146. 2297.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

The value we are looking for is in the first column, r.squared. All we need to do is to convert it to % and round to one decimal place:

m_mass %>%
  broom::glance() %>%
  dplyr::mutate(rsq_perc = r.squared * 100) %>%
  dplyr::pull(rsq_perc) %>%
  round(1)

[1] 75.9

Now that you’ve refreshed your memory of the single-predictor model we built last time, let’s extend it by adding another predictor.

Multiple-predictor model

Imagine we’re also interested in whether or not bill length is related to flipper length. We can extend out m_mass model to explore this research question.

Before we do that, let’s look at what a simple model of flipper length as predicted by bill length tells us.

Task 6

Fit this model, saving it as m_bill, inspect the results and answer the following questions.

QUESTION 10

What kind of relationship between bill length and flipper length does the model describe

Positive

Correct!That’s not right…

QUESTION 11

By how many millimetres is flipper length predicted to increase for every 1 mm increase in a penguin’s bill length? (give answer to 2 decimal places)

1.69

Correct!That’s not right…

QUESTION 12

Is this a statistically significant relationship (assuming an α = .05)?

Yes

Correct!That’s not right…

QUESTION 13

What proportion of total variance in flipper length is accounted for by bill length? (give answer as % to 1 decimal place, such as 77.2)

43.1

Correct!That’s not right…

First, let’s fit the model and save the summary using broom::tidy()

m_bill <- lm(flipper_length_mm ~ bill_length_mm, peng_data)
m_bill_smry <- m_bill %>% broom::tidy()

Let’s have a look:

m_bill_smry

# A tibble: 2 x 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      127.       4.67       27.2 3.65e-87
2 bill_length_mm     1.69     0.105      16.0 1.74e-43

The coefficient for bill_length_mm, in other words, the slope for the variable (\(b_1\)), is positive. Given that both our predictor and outcome represent length in millimetres, and so a larger number represents more of the quantity that’s being measured, this is a positive relationship.

Both variables in the model are measured in mm and so the coefficient is interpreted as an increase in flipper length of 1.69 mm per every 1 mm increase in bill length.

The p-value is tiny (less than 2 × 10⁻¹⁶), and so \(b_1\) is statistically significantly different from zero, given the \(\alpha\) = .05 significance level.

Finally, the proportion of variance explained is:

m_bill %>%
  broom::glance() %>%
  dplyr::mutate(rsq_perc = r.squared * 100) %>%
  dplyr::pull(rsq_perc) %>%
  round(2)

[1] 43.06

OK, so we know that bill length is a significant predictor or flipper length. However, it could well be the case, that both of these variables are just reflections of body mass and, if we take body mass into account, the relationship between bill and flipper length disappears. In other words, it’s possible that larger penguins have both longer bills and longer flippers but once you only consider birds of a certain body mass, the ones with longer bills do not have longer flippers. Putting both of the predictors in the model allows us to see whether or not this is actually the case.

A careful inspection of the R² values of our two models reveals that the relationship is not going to be completely straightforward. The first model accounts for 75.9% of the variance in the outcome, while the second model accounts for 43.1%. That means, that the model that combines the two predictors cannot simply be the same as the two models added together. The reason for this is that the total amount of variance in flipper length is, well, 100% and not 75.9% + 43.1% = 119%.

In other words, some of the variance accounted for by the two predictors will be shared between them.

Let’s do it then!

Task 7

Let’s fit and interpret a two-predictor model of flipper length one step at a time.

Task 7.1

First of all, let’s fit the model using lm() and save it as m_mass_bill.

Check out lecture 9 if you can’t remember how to add multiple predictors into the formula.

m_mass_bill <- lm(flipper_length_mm ~ body_mass_g + bill_length_mm, peng_data)

Task 7.2

Get the summary of the model and have a look at it. Are the model coefficients the same as the ones from m_mass and m_bill? If not, how are they different?

m_mass_bill_smry <- m_mass_bill %>% broom::tidy()
m_mass_bill_smry

# A tibble: 3 x 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    122.      2.86         42.7  1.70e-138
2 body_mass_g      0.0131  0.000545     23.9  7.56e- 75
3 bill_length_mm   0.549   0.0801        6.86 3.31e- 11

As you can see, both “effects”¹ have reduced in size. They are still positive and statistically significant but the effect of body mass has shrunk from 0.015 mm per each gramme to 0.013 mm/g, while the effect of bill length has gone down from 1.69 mm increase in flipper length per each additional mm of bill to just 0.55 mm per mm. This has to do with the above-mentioned fact that some of the explained variance in the outcome is shared by the two predictors.

Remember that the interpretation of the slopes in a multi-predictor model is different from that in a single-predictor model.

In our particular model, the \(b\) coefficient for body mass represents the increase in flipper length given a certain fixed bill length. Conversely, the \(b\) coefficient for body bill length represents the increase in flipper length given a certain fixed body mass.

Task 8

Given the coefficients of our model – 121.956, 0.013, and 0.549 – give the predicted value of flipper length for the following values of body mass and bill length. (2 decimal places)

Do you dare use R to do the numbers for you? 😛

QUESTION 14

Mass: 4,200 g
Bill length: 45 mm

201.48 | 201.26

Correct!That’s not right…

QUESTION 15

Mass: 4,200 g
Bill length: 30 mm

193.25 | 193.03

Correct!That’s not right…

QUESTION 16

Mass: 3,800 g
Bill length: 45 mm

196.26 | 196.06

Correct!That’s not right…

QUESTION 17

Mass: 3,800 g
Bill length: 30 mm

188.03 | 187.83

Correct!That’s not right…

All you need to do to answer these questions is to plug the right numbers into the formula and do the maths. For instance, the answer to the first question is:

\[\begin{align}\hat{y}&=b_0 + b_1\times x_1 + b_2\times x_2\\&=b_0 + b_1\times\text{body_mass} + b_2\times\text{bill_length}\\&\approx121.956 + 0.013\times\text{body_mass} + 0.549\times\text{bill_length}\\&\approx121.956 + 0.013\times4,200 + 0.549\times45\\&\approx201.26\end{align}\]

A good thing to notice is the relationships between the answers to each of the questions and the \(b\) coefficients.

The values of body mass in questions 14 and 15 are the same and the values of bill length differ by 15 mm. That means that the difference between the results is going to be 15 × 0.549 = 8.24. That’s going to be the same as the difference between answers to questions 16 and 17 because the difference in the respective values of bill length in these two questions are also 15 mm, while the values of body mass are, again, the same.

Questions 14 and 16 on the one hand and 15 and 17 on the other have the same values of bill length but, in both cases, they differ in body mass by 400 g. That means that the differences between both pairs of results is going to be 400 × 0.013 = 5.2.

Finally, the differences between questions 14 and 17 and 15 and 16, respectively, is going to be the sum of the previous two differences: 400 × 0.013 + 15 × 0.549 = 13.44.

Calculating predicted values in `R`

Of course, we can tell the computer to calculate the predicted values for us!

There are multiple ways of doing this and there are even functions written exactly for this purpose, such as predict() or broom::augment(). But, for now, let’s use what we already know.

If you look at the output of our model using broom::tidy() (which we saved as m_mass_bill_smry), you’ll find the coefficients there, in the estimate column:

m_mass_bill_smry

# A tibble: 3 x 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    122.      2.86         42.7  1.70e-138
2 body_mass_g      0.0131  0.000545     23.9  7.56e- 75
3 bill_length_mm   0.549   0.0801        6.86 3.31e- 11

We can use this column to calculate the predicted value for any set of predictor values. All we need is a column of values to multiply the estimate column by and then add the products together:

	b	x	b * x
	121.956	1	121.956
	0.013	4200	54.813
	0.549	45	24.715
sum			201.484

To do this in R you only need to come up with an algorithm. An OK one could look something like this:

Add a column, let’s call it x, with the values of predictors. Since we’re multiplying two columns, the predictor column also needs a 1 to multiply the intercept by.
Add another column that is the product of the multiplication of the estimate and x columns. Let’s call it bx.
Pull the bx column from the tibble.
Add up all its elements.
Round the result to two decimal places.

All that’s left is to translate these steps into R using basic data wrangling functions:

m_mass_bill_smry %>%
  dplyr::mutate(x = c(1, 4200, 45), # step 1
                bx = estimate * x) %>% # step 2
  dplyr::pull(bx) %>% # step 3
  sum() %>% # step 4
  round(2) # step 5

[1] 201.48

And there’s your answer… It differs from the result of the calculation in the Solution box (201.26), because here we are not rounding our coefficients to 3 decimal places. Not doing so makes the answer more precise.

With a little thinking, you could even write code that calculates the answer to all four questions in one go!

Why don’t you give it a try as an extra task? Using the predict() function will make it very easy.

Task 9

Look at the model fit statistic, R² in particular.

QUESTION 18

What proportion of variance in the outcome does the model account for? (Give answer as % to 1 decimal place)

78.8

Correct!That’s not right…

m_mass_bill %>% broom::glance()

# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl>
1     0.788         0.787  6.49      631. 4.88e-115     2 -1123. 2255.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

Notice that R² of this model is far smaller than the R² of the two one-predictor models added together. In fact, it’s not that much larger than that of the body mass only model: While that model accounted for 75.9% of the total variance in flipper length, this one accounts for 78.8%. Once again, this is because there is a substantial overlap (correlation) between body mass and bill length.

Transforming variables in the model

When we were reporting our body mass model in the last tutorial, we changed the scale of the predictor from grammes to kilogrammes. This is by no means a problematic thing to do. After all, given the mean and SD of the body_mass_g variable (4201.75 and 801.95, respectively), it is not really reasonable to expect every 1 g increase in body mass to have a substantial effect on flipper length and so working with body mass in kg is fine.

Task 10

Rescale the body mass variable, refit the model and inspect the results to see how the coefficients change.

Creating a mass_kg variable in the peng_data shouldn’t be difficult as it’s just a conversion of grammes to kilogrammes…

# first, create the re-scaled variable
peng_data <- peng_data %>%
  dplyr::mutate(mass_kg = body_mass_g / 1000)
# re-fti the model
m_massKg_bill <- lm(flipper_length_mm ~ mass_kg + bill_length_mm, peng_data)
# see resutls
m_massKg_bill_smry <- m_massKg_bill %>% broom::tidy()
m_massKg_bill_smry

# A tibble: 3 x 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     122.       2.86       42.7  1.70e-138
2 mass_kg          13.1      0.545      23.9  7.56e- 75
3 bill_length_mm    0.549    0.0801      6.86 3.31e- 11

You should be able to see that both the intercept and the slope for bill length stayed the same as in the original model but the slope for mass (kg) scaled up by 1,000, compared to the one in the original model. Other than that, it’s the same number.

So as you can see, scaling the predictors scales the coefficients accordingly. If we wanted, we could scale flipper or bill length to some other units (such as centimetres) but, in this case, millimetres work well.

What doesn’t work as well, however, is the intercept. Not that there’s something wrong with it but, in the current model, it’s not very meaningful. After all, the flipper length of a hypothetical massless bill-less penguin is not that interesting. If may be more interesting to know what the expected flipper length is for a penguin of average size and average bill length.

Task 11

Re-fit the model with the predictors centred around their respective means.

To mean-centre a variable, you can either subtract its mean from it, or use the scale() function with the right combination of its arguments. Check out the documentation for the function (?scale) to find out what arguments the function accepts.

peng_data <- peng_data %>%
  mutate(
    mass_kg_cntr = mass_kg - mean(mass_kg, na.rm = T), # using mean
    bill_length_cntr = scale(bill_length_mm, scale = FALSE)) # using scale

m_massKg_bill_cntr <- lm(flipper_length_mm ~ mass_kg_cntr + bill_length_cntr, peng_data)
# see resutls
m_massKg_bill_cntr_smry <- m_massKg_bill_cntr %>% broom::tidy()
m_massKg_bill_cntr_smry

# A tibble: 3 x 5
  term             estimate std.error statistic  p.value
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)       201.       0.351     573.   0       
2 mass_kg_cntr       13.1      0.545      23.9  7.56e-75
3 bill_length_cntr    0.549    0.0801      6.86 3.31e-11

QUESTION 19

So… What is the expected flipper length (in mm) for a penguin of average size and average bill length? (2 decimal places)

200.92

Correct!That’s not right…

Standardised B

The centred model we just made is pretty informative. However, it’s not immediately obvious which of our two predictors has a stronger effect on the outcome. That’s because the two predictors are measured on incomparable scales (kg vs mm). This is where standardised coefficients (Bs) come in handy: because they are both on the scale of standard deviations, they allow us to compare effects of very different variables.

Task 12

Use parameters::model_parameters() with the standardize = "refit" argument to get B coefficients for the predictors in our model.

m_massKg_bill_cntr %>% parameters::model_parameters(standardize = "refit")

Parameter        | Coefficient |   SE |        95% CI |    t(339) |      p
--------------------------------------------------------------------------
(Intercept)      |   -8.96e-16 | 0.02 | [-0.05, 0.05] | -3.59e-14 | > .999
mass kg cntr     |        0.74 | 0.03 | [ 0.68, 0.81] |     23.94 | < .001
bill length cntr |        0.21 | 0.03 | [ 0.15, 0.27] |      6.86 | < .001

QUESTION 20

This of the predictors has the strongest effect on the outcome?

Body mass

Correct!That’s not right…

Since one SD increase in body mass is associated with a 0.74 SD increase in flipper length, while one SD increase in bill length is associated with only a 0.21 SD increase in flipper length, body mass has the larger effect on the outcome.

Reporting

Task 13

Report the findings of the model. At the very least, you should report R² and the individual model coefficients along with their test of significance.

Check out tutorial 9 for inspiration.

The model accounted for a significant proportion of variance in flipper length, \(R^2 = .79\), \(F(2, 339) = 631.39\), \(p < .001\). Body mass was positively related to flipper length, with flipper length increasing by 13.05 millimetres with each one kilogramme increase in body mass, 95% CI \([11.98\), \(14.12]\), \(t(339) = 23.94\), \(p < .001\). Bill length was also significantly related to the outcome, \(b = 0.55\), 95% CI \([0.39\), \(0.71]\), \(t(339) = 6.86\), \(p < .001\).

That’s all for today. Good job!

Here, effect is a technical term and does not imply a causal link between the predictor and the outcome!↩︎

Contents

Warm-up quiz

QUESTION 1

QUESTION 2

QUESTION 3

QUESTION 4

QUESTION 5

Setting Up

Task 1

Task 2

Data

Task 3

Task 4

Single-predictor model

Task 5

Task 5.1

Task 5.2

Task 5.3

QUESTION 6

QUESTION 7

QUESTION 8

Task 5.4

QUESTION 9

Multiple-predictor model

Task 6

QUESTION 10

QUESTION 11

QUESTION 12

QUESTION 13

Task 7

Task 7.1

Task 7.2

Task 8

QUESTION 14

QUESTION 15

QUESTION 16

QUESTION 17

Calculating predicted values in R

Task 9

QUESTION 18

Transforming variables in the model

Task 10

Task 11

QUESTION 19

Standardised B

Task 12

QUESTION 20

Reporting

Task 13

Calculating predicted values in `R`