Practical 11 worksheet

Instructions

This is a marked worksheet that contains 7 questions. The questions cover topics from last week's lectures and skills lab, and the tutorial you just completed. Before you begin the worksheet, you should first read these instructions and complete the analyses described in "Analysis", below.

You will have 7 randomly selected questions to answer; do not be surprised if you have different questions from others working on the same worksheet!

To access the worksheet, you must attend your practical session. In the session, a passcode will be announced to unlock the worksheet; you must begin the worksheet within 5 minutes of the passcode being released. You will have 30 minutes to complete the worksheet, unless you have reasonable adjustments for extra time (38 minutes for 25% extra time, and 45 minutes for 50% extra time).


Academic Honesty

You are welcome to use module resources - e.g. lecture slides, tutorials, skills labs scripts - to answer these questions. You are also welcome to use RStudio to solve problems, do calculations, or refer to output. However, you should not work with other students while you are completing the worksheet, and tutors will only be able to answer questions about technical problems (e.g. computer crash).

Setting Up

Task 1

Open your project for this week in RStudio. Then, open a new Markdown file with HTML output and save it in the r_docs folder. (Give it a sensible name, like worksheet_11 or similar!)

For each of the tasks in the Analysis section, write your code to complete the task in a new code chunk.

Remember, you can add new code chunks by:

  1. Using the RStudio toolbar: Click Code > Insert Chunk
  2. Using a keyboard shortcut: the default is Ctrl + Alt + I (Windows) or ⌘ Command + Option + I (MacOS), but you can change this under Tools > Modify Keyboard Shortcuts…
  3. Typing it out: ```{r}, press ↵ Enter, then ``` again.
  4. Copy and pasting a code chunk you already have (but be careful of duplicated chunk names!)

For this week’s stats practical, we’re continuing to work with the dataset of Pokémon characteristics to practice the linear model. You don’t need to know anything about Pokémon to do this practical.

Task 2

Load the tidyverse package and read in the pokemon dataset straight from https://and.netlify.app/datasets/pokemon.csv (we took the data from this fun website).

### load tidyverse if working on your own computer
library(tidyverse)

### on university computers, load
library(dplyr)
library(readr)

### read in data
pokemon <- readr::read_csv("https://and.netlify.app/datasets/pokemon.csv")

Making Predictions

First we need to choose a research question to investigate.

As future Pokémon Masters, we want to work out which Pokémon is the strongest. So, we’ll use attack as our outcome variable, which quantifies a Pokémon’s offensive capabilities.

For the first predictor, let’s choose the hp variable. HP or hit points is a geek gaming term name for health of your character/creature: The more HP a thing has, the more damage it can withstand.

Task 3

In your RMarkdown file, write down your prediction about the relationship between the predictor – hp – and the outcome, attack.

For the second predictor, choose a variable of your own choosing from the dataset. We’d recommend you choose from defense (defense stat rating), speed (how quick the Pokemon is), weight_kg (the Pokemon’s weight in kilograms), or height_m (the Pokemon’s height in meters). For the purposes of the solutions, I will choose speed, but you can pick whichever you like.

Task 4

In your RMarkdown file, write a second prediction about the relationship between your chosen second predictor and the outcome, attack.

Visualising the Relationship

Next up, we should have a look at our data. We already visualised the relationship between hp and attack last week, so this week, you should focus on your new, chosen predictor.

Task 5

Create a scatterplot of your new predictor and attack as the outcome.

pokemon %>% 
  ggplot(aes(x = speed, y = attack)) +
  geom_point()

Task 5.1

Label the axes better and clean up the plot with a theme.

pokemon %>% 
  ggplot(aes(x = speed, y = attack)) +
  geom_point() +
  scale_x_continuous(name = "Speed Rating", 
                     # seq() creates a vector of breaks at increments of 25
                     breaks = seq(0, max(pokemon$speed), by = 25)) +
  scale_y_continuous(name = "Attack Stat",
                     breaks = seq(0, max(pokemon$attack), by = 25)) +
  cowplot::theme_cowplot()

Creating the Model

Now that we have some idea of what to expect, we can create our linear model!

Task 6

Use the lm() function to run the analysis with both predictors and save the model in a new object called poke_lm_2.

# remember to use whichever predictor you've chosen!
poke_lm_2 <- lm(attack ~ hp + speed, data = pokemon)

Task 7

Call this poke_lm_2 object in the Console to view it, then write out the linear model from the output.

First, to see what’s in our model, we can just call it by typing its name in the console and pressing Enter.

poke_lm_2

Call:
lm(formula = attack ~ hp + speed, data = pokemon)

Coefficients:
(Intercept)           hp        speed  
    25.8240       0.4396       0.3274  

This simple version of the output gives us the bs for this model.

  • (Intercept): the intercept, here 25.82
  • hp: the slope associated with the HP predictor, here 0.44
  • speed: the slope associated with the speed predictor, here 0.33

Keep in mind that the slope will always be named with the predictor that it’s associated with!

So, we can write our linear model: \(\text{Attack}_i = 25.82 + 0.44 \times \text{HP}_i + 0.33 \times \text{speed}_i\).

Task 8

How can you interpret the value of b2 for this model? Write down your thoughts in your RMarkdown.

For the speed predictor, the value of b2 is 0.33. This is the predicted change in attack for each unit change in speed. In other words, for each point increase in the speed rating a Pokémon has, it’s predicted to a stronger attack ability by 0.33 - a positive relationship.

Task 9

Using your equation, what attack value would you predict a Pokémon with 86 HP and 100 speed to have?

Start with our equation: \(\text{Attack}_i = 25.82 + 0.44 \times \text{HP}_i + 0.33 \times \text{speed}_i\).

All we need to do is replace the value of HP with the specific value we’re interested in:

\[\begin{aligned}\text{Attack} &= 25.82 + 0.44 \times 86 + 0.33 \times 100\\ &= 25.82 + 37.84 + 33\\ &= 96.66\end{aligned}\]

Evaluating the Model

Now we have the model parameters, but we don’t want to just describe the line - we want to be able to say something about the population, not just our sample. For this, we need some more info!

Significance Testing

Task 10

Use broom::tidy() to get p-values for your bs. Are your predictors significant?

Remember, that in the so-called scientific notation, the number xe-n, where x and n are numbers means x × 10−n. For example 2.3e-4 means 2.3 × 10−4 which is the same as 2.3/10000 or 0.00023.

poke_lm_2 %>% broom::tidy()
# A tibble: 3 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   25.8      3.31        7.79 2.08e-14
2 hp             0.440    0.0375     11.7  2.21e-29
3 speed          0.327    0.0345      9.49 2.51e-20

Yikes, look at those p-values! That’s a lot of 0s. I think we can safely say that all of these values are smaller than .05, although the one we really care about here is for the predictors.

Task 11

Add the conf.int = T argument to broom::tidy() to get confidence intervals. How can you interpret these? Do they “agree” with the p-value?

poke_lm_2 %>% broom::tidy(conf.int = T)
# A tibble: 3 x 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   25.8      3.31        7.79 2.08e-14   19.3      32.3  
2 hp             0.440    0.0375     11.7  2.21e-29    0.366     0.513
3 speed          0.327    0.0345      9.49 2.51e-20    0.260     0.395

Again, we’re mainly interested in b2, which is speed (we already talked last week about how to interpret the value associated with hp). First we want to know if these confidence intervals cross or include 0. Looking at these two values, neither of them are exactly 0, and both values are positive. So, they don’t cross or include 0.

Remember that the confidence intervals give us a range of likely values of b2 from other samples. If the confidence intervals don’t cross or include 0, as we can see here, it’s unlikely that we could have gathered a sample where b2 was 0. This “agrees” with our p-value, which indicated that the probability of obtaining the results we have, assuming the null hypothesis is true, is very very small. Altogether the evidence suggests that it’s quite unlikely that the relationship between speed and attack is actually 0 (no relationship at all).

Goodness of Fit

Task 12

Use broom::glance() to get R2 and adjusted R2 for this model. How much of the variance of the outcome is explained by the model for this sample? What would we expect this to be in the population?

poke_lm_2 %>% broom::glance()
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1     0.253         0.251  27.8      135. 2.88e-51     2 -3799. 7607.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

For now we can ignore everything in this output except for the values labeled r.squared and adj.r.squared. To answer the questions, we can simply convert them to percentages by multiplying by 100.

Variance in the outcome explained by the model in the sample: 25.3%

Estimated variance in the outcome explained by the model in the population: 25.11%

Comparing Predictors

We can compare the relative impact of the predictors if we standardise the bs - convert them into standard deviation units.

Task 13

Produce standardised bs using the parameters::model_parameters() with the `standardize = “refit” argument. Which predictor has a stronger relationship with the outcome?

parameters::model_parameters(poke_lm_2, standardize = "refit")
Parameter   | Coefficient |   SE |        95% CI |   t(798) |      p
--------------------------------------------------------------------
(Intercept) |    3.71e-16 | 0.03 | [-0.06, 0.06] | 1.21e-14 | > .999
hp          |        0.36 | 0.03 | [ 0.30, 0.42] |    11.72 | < .001
speed       |        0.29 | 0.03 | [ 0.23, 0.36] |     9.49 | < .001

These values are standardised, so we can compare the size of the coefficients directly. This means that the coefficient with the bigger value is the one that has the stronger effect on the outcome.

Reading the Summary

Finally, we can get all this same information - except for CIs and standardised Bs - from the summary() function. I (Jennifer) like summary() because you can get a good overview of a lot of information quickly, but it’s very inconvenient for reporting, so it’s good to know how to use the broom functions as well.

Task 14

Get a summary of your model. What do the asterisks (stars) mean?

poke_lm_2 %>% summary()

Call:
lm(formula = attack ~ hp + speed, data = pokemon)

Residuals:
     Min       1Q   Median       3Q      Max 
-147.100  -18.759   -2.561   16.271   99.451 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 25.82404    3.31490   7.790 2.08e-14 ***
hp           0.43962    0.03751  11.720  < 2e-16 ***
speed        0.32740    0.03449   9.494  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.83 on 798 degrees of freedom
Multiple R-squared:  0.253, Adjusted R-squared:  0.2511 
F-statistic: 135.1 on 2 and 798 DF,  p-value: < 2.2e-16

We can see all of the same numbers here that we saw before, even labeled the same way. summary() also gives us the stars at the end of each line, which give us an idea of what alpha level each element of the model is significant at. As you can see, both predictors have three stars, which corresponds to p < .001.

Reporting

Task 15

Report the results of your model. Specifically, you should clearly report both coefficients and the fit of the model (R2), including all the important details (see Tutorial 8). You should also report and interpret the standardised Bs.

You can read all the important information from the summary() and of your model type them up manually but you can also use the apa_print() function from the papaja package. To do that, let’s first see what the function returns:

papaja::apa_print(poke_lm_2)
$estimate
$estimate$Intercept
[1] "$b = 25.82$, 95\\% CI $[19.32$, $32.33]$"

$estimate$hp
[1] "$b = 0.44$, 95\\% CI $[0.37$, $0.51]$"

$estimate$speed
[1] "$b = 0.33$, 95\\% CI $[0.26$, $0.40]$"

$estimate$modelfit
$estimate$modelfit$r2
[1] "$R^2 = .25$"

$estimate$modelfit$r2_adj
[1] "$R^2_{adj} = .25$"

$estimate$modelfit$aic
[1] "$\\mathrm{AIC} = 7,606.56$"

$estimate$modelfit$bic
[1] "$\\mathrm{BIC} = 7,625.30$"



$statistic
$statistic$Intercept
[1] "$t(798) = 7.79$, $p < .001$"

$statistic$hp
[1] "$t(798) = 11.72$, $p < .001$"

$statistic$speed
[1] "$t(798) = 9.49$, $p < .001$"

$statistic$modelfit
$statistic$modelfit$r2
[1] "$F(2, 798) = 135.12$, $p < .001$"



$full_result
$full_result$Intercept
[1] "$b = 25.82$, 95\\% CI $[19.32$, $32.33]$, $t(798) = 7.79$, $p < .001$"

$full_result$hp
[1] "$b = 0.44$, 95\\% CI $[0.37$, $0.51]$, $t(798) = 11.72$, $p < .001$"

$full_result$speed
[1] "$b = 0.33$, 95\\% CI $[0.26$, $0.40]$, $t(798) = 9.49$, $p < .001$"

$full_result$modelfit
$full_result$modelfit$r2
[1] "$R^2 = .25$, $F(2, 798) = 135.12$, $p < .001$"



$table
A data.frame with 5 labelled columns:

  predictor estimate                 ci statistic p.value
1 Intercept    25.82 $[19.32$, $32.33]$      7.79  < .001
2        Hp     0.44   $[0.37$, $0.51]$     11.72  < .001
3     Speed     0.33   $[0.26$, $0.40]$      9.49  < .001

predictor: Predictor 
estimate : $b$ 
ci       : 95\% CI 
statistic: $t(798)$ 
p.value  : $p$ 
attr(,"class")
[1] "apa_results" "list"       

As you can see, the returned list contains quite a lot of stuff:

  • The estimate element contains, among other things your b and R2 coefficients along with their respective 95% and 90% CIs, respectively. The reason why R2 is reported with a 90% rather than a 95% CI is not really important right now.
  • The statistic element contains the results of the statistical tests of the individual coefficients (t and p-values) and the overall model fit (F-test, which we haven’t covered yet).
  • The full_results element contains the two combined.
  • Finally, the table element contains the data we can use to easily create a basic table of model results.

To ask for the individual elements, we should first save the output of the function in the environment by assigning it to a name:

res <- papaja::apa_print(poke_lm_2)

We can now use the dollar sign notation to ask for individual elements. Let’s see what the full_result one gives us:

res$full_result
$Intercept
[1] "$b = 25.82$, 95\\% CI $[19.32$, $32.33]$, $t(798) = 7.79$, $p < .001$"

$hp
[1] "$b = 0.44$, 95\\% CI $[0.37$, $0.51]$, $t(798) = 11.72$, $p < .001$"

$speed
[1] "$b = 0.33$, 95\\% CI $[0.26$, $0.40]$, $t(798) = 9.49$, $p < .001$"

$modelfit
$modelfit$r2
[1] "$R^2 = .25$, $F(2, 798) = 135.12$, $p < .001$"

The outcome is again a list we can further query using $:

# report full results for slope of HP
res$full_result$hp
[1] "$b = 0.44$, 95\\% CI $[0.37$, $0.51]$, $t(798) = 11.72$, $p < .001$"
# report full results for slope of speed
res$full_result$speed
[1] "$b = 0.33$, 95\\% CI $[0.26$, $0.40]$, $t(798) = 9.49$, $p < .001$"
# report model fit
res$full_result$modelfit$r2
[1] "$R^2 = .25$, $F(2, 798) = 135.12$, $p < .001$"

To report the results in the body of our report/paper/dissertation, we would say something like:

The model accounted for a substantial proportion of variance in the ouctome, \(R^2 = .25\), \(F(2, 798) = 135.12\), \(p < .001\). The number of hit points of a Pokémon was a significant predictor of its attack strength, \(b = 0.44\), 95% CI \([0.37\), \(0.51]\), \(t(798) = 11.72\), \(p < .001\). A Pokémon’s speed rating was also a significant predictor of its attack strength, \(b = 0.33\), 95% CI \([0.26\), \(0.40]\), \(t(798) = 9.49\), \(p < .001\). These results support our hypothesis. Comparing the effects of the two predictors using standardised Bs, HP had a greater effect on attack (B = 0.36) than speed (B = 0.29).

You can also easily report the results in a nice table:

# <br> in caption is a line break
res$table %>%
  kableExtra::kbl(
    col.names = c("Predictor", "*b*", "95% CI", "*t*(799)", "*p*"),
    caption = "Table 1<br>*Results of the linear model predicting attack strength by the number of hit points (HP) and speed rating (Speed).*"
  ) %>%
  kableExtra::kable_classic(full_width=FALSE)
Table 1: Table 1
Results of the linear model predicting attack strength by the number of hit points (HP) and speed rating (Speed).
Predictor b 95% CI t(799) p
Intercept 25.82 \([19.32\), \(32.33]\) 7.79 < .001
Hp 0.44 \([0.37\), \(0.51]\) 11.72 < .001
Speed 0.33 \([0.26\), \(0.40]\) 9.49 < .001

I Know It’s My Destiny

That’s as far as we’ll go today, well done!

The Linear Model is all of our destinies for the next year or so, so it’s important to get comfortable working with it.