Linear Model 1: A New Equation

PDF

# Linear Model 1: A New Equation
### Dr Jennifer Mankin
### 7 March 2022

---

## Overview

- Reminder: the TAP!

- The Linear Model

- What is modeling?
  
  - Model with continuous predictor
  
  - Model with categorical predictor

## Reminder: The TAP

The **take-away paper** is currently live!

- See [Take-Away Paper Information](https://canvas.sussex.ac.uk/courses/17281/pages/take-away-paper-information):

- Download the Rmd document to complete
  
  - All information on preparing and submitting the assessment
  
  - All necessary background information, tips, and FAQs
  
- If you encounter a technical problem, [book onto the help desk](https://canvas.sussex.ac.uk/courses/17281/external_tools/7463)

- Two additional hours from 2:30-4:30 on Wednesday the 7th

---

## Objectives

After this lecture you will understand:

- What a statistical model is and why they are useful

- The equation for a linear model with one predictor

- b0 (the intercept)

- b1 (the slope)

- Using the equation to predict an outcome

- How to read scatterplots and lines of best fit

---

## The Linear Model

- Extremely common and fundamental testing paradigm

- Predict the outcome *y* from one or more predictors (*x*s)

- Our first (explicit) contact with statistical modeling
  
--

- A **statistical model** is a mathematical expression that captures the relationship between variables

- All of our test statistics are actually models!
  
---

## Maps as Models

- A map is a simplified depiction of the world

- Captures the important elements (roads, cities, oceans, mountains)

- *Doesn't* capture individual detail (where your gran lives)
  
--

- Depicts **relationships** between locations and geographical features

- Helps you **predict** what you will encounter in the world

- E.g. if you keep walking south eventually you'll fall in the sea!

---

## Statistical Models

- A model is a simplified depiction of some relationship

- We want to **predict** what will happen in the world

- But the world is complex and full of noise (randomness)

- We can build a model to try to capture the important elements

- Gather a sample that (we assume) is representative of the population
  
  - Investigate and quantify the relationships in that sample (ie construct a model)

- Change/adjust the model to see what might happen with different parameters
  
---

## Statistical Models

- **Why** might it be useful to create a model like this?

- Can you think of any recent examples of such models?

- [One example of modelling you might all be familiar with!](https://covid19.healthdata.org/global?view=cumulative-deaths&tab=trend)

---

## Predictors and Outcomes

- Now we start assigning our variables roles to play

- The **outcome** is the variable we want to explain
  
  - Also called the dependent variable, or DV

- The **predictors** are variables that may have a relationship with the outcome

- Also called the independent variable(s), or IV(s)

- We measure or manipulate the predictors, then quantify the systematic change in the outcome

- NB: **YOU** (the researcher) assign these roles!
  
---

## General Model Equation

`$$outcome = model + error$$`

- We can use models to **predict** the outcome for a particular case

- This is always subject to some degree of **error**

---

## Linear Model Equation

`$$y_{i} = b_{0} + b_{1}x_{1i} + e_{i}$$`
</center>

- `$y_{i}$`: the predicted value of the outcome

- `$b_{0}$`: the intercept

- `$b_{1}$`: the slope

- `$x_{1i}$`: the predictor

- `$e_{i}$`: the error in prediction

---

## Linear Model Equation

`$$y_{i} = b_{0} + b_{1}x_{1i} + e_{i}$$`

- We will next see:

- How we can create a line that captures the relationship between those two variables

- How we can adapt this general LM equation to describe that line

---

## Visualising the Line

- Where would you draw a line through these dots that best captures where they tend to fall?

```r
gensex %>%
  mutate(gender = fct_explicit_na(gender)) %>% 
  ggplot(aes(x = gender_fem, y = gender_masc)) +
  geom_point(position = "jitter", size = 2, alpha = .4) +
  scale_x_continuous(name = "Femininity", breaks = c(0:9)) +
  scale_y_continuous(name = "Masculinity", breaks = c(0:9)) +
  cowplot::theme_cowplot()
```

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-1-1.png)]

---

## Visualising the Line

```r
gensex %>%
  mutate(gender = fct_explicit_na(gender)) %>% 
  ggplot(aes(x = gender_fem, y = gender_masc)) +
  geom_point(position = "jitter", size = 2, alpha = .4) +
  geom_smooth(method = "lm", formula = y ~ x) +
  scale_x_continuous(name = "Femininity", breaks = c(0:9)) +
  scale_y_continuous(name = "Masculinity", breaks = c(0:9)) +
  cowplot::theme_cowplot()
```

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-2-1.png)]

---

## Visualising the Line

- The data points tend to be higher up on the right and lower down on the left

- So as the variable on *x* (here, ratings of femininity) increases...
  
  - The variable on *y* (here, ratings of masculinity) tends to decrease
  
  - This represents a **negative relationship** between *x* and *y*: as one goes up, the other goes down

- Our line captures this by going downwards from left to right

---

## Visualising the Line

- Two key *parameters*: where the line starts, and its slope

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-3-1.png)]

---

## Modeling Gender Ratings

We can make some estimates:

- The line would cross the *y*-axis somewhere between 8 and 9 (close to 9)

- `$b_{0} \approx 8.5$`

--
  
- Every time we go up one point on the femininity scale, masculinity goes down by a little less than one point
  
  - `$b_{1} \approx -0.8$`

---
  
## Modeling Gender Ratings

`$$y_{i} = b_{0} + b_{1}x_{1i} + e_{i}$$`

- `$y_{i}$` (outcome): Masculinity

- `$x_{1i}$` (predictor): Femininity

- `$b_{0}$` (intercept): the predicted value of masculinity when femininity is 0

- `$b_{1}$` (slope): **change** in masculinity associated with a **unit change** in femininity

`$$Masculinity_{i} = b_{0} + b_{1}Femininity_{1i} + e_{i}$$`

---

## Modeling Gender Ratings

How do we get the real numbers?

```
## 
## Call:
## lm(formula = gender_masc ~ gender_fem, data = gensex)
## 
## Coefficients:
## (Intercept)   gender_fem  
##      8.8246      -0.7976
```

Adapt our equation to include the real *b* values:

`$$Masculinity_{i} = 8.82 -0.8\times Femininity_{1i} + e_{i}$$`

---

## Predicting Gender

- We can now use this model to **predict** someone's rating of masculinity, if we know their rating of femininity

- someone who doesn't identify strongly with femininity: `gender_fem` = 3
 
 - What would the model **predict** for this person's masculinity rating?
 

`$$Masculinity_{i} = 8.82 -0.8\times Femininity_{1i}$$`

---

## Predicting Gender

`$Masculinity_{i} = 8.82 -0.8\times Femininity_{1i}$`

- `$Masculinity_{i} = 8.82 -0.8\times 3$`
  
  - `$Masculinity_{i} = 6.42$`

So, someone with femininity = 3 is **predicted** to have a masculinity rating of 6.42

- This is subject to some (unknowable!) degree of error
  
---

## Predicting Gender

Someone with a femininity rating of 3 is **predicted** to have a masculinity rating of 6.42

```r
gensex %>%
  mutate(gender = fct_explicit_na(gender)) %>% 
  ggplot(aes(x = gender_fem, y = gender_masc)) +
  geom_point(position = "jitter", size = 2, alpha = .4) +
  geom_smooth(method = "lm", formula = y ~ x) + 
  geom_vline(xintercept = 3, linetype = "dashed") +
  geom_hline(yintercept = 6.42, linetype = "dashed") +
  scale_x_continuous(name = "Femininity", breaks = c(0:9)) +
  scale_y_continuous(name = "Masculinity", breaks = c(0:9)) +
  cowplot::theme_cowplot()
```

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-6-1.png)]

---

## Interim Summary

- The linear model predicts the outcome *y* based on a predictor *x*

- General form: `$y_{i} = b_{0} + b_{1}x_{1i} + e_{i}$`
  
  - `$b_{0}$`, the intercept, is the value of *y* when *x* is 0

- `$b_{1}$`, the slope, is the change in *y* for every unit change in *x*

- The slope, `$b_{1}$`, is the key piece of information, because it represents the relationship between the predictor and the outcome

- Up next: categorical predictors

---

.pollEv[
<iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe>
]

---

## Words and Colours

In [Tutorial 5](https://and.netlify.app/tutorials/05), we looked at synaesthesia and imagery

- Let's revisit those ideas using the linear model!

- If I wanted to **predict** the next random person's overall imagery score...

- What would be the most sensible *estimate*?

---

## Making Predictions

```r
syn_data %>% 
  ggplot(aes(x = overall_img)) +
  geom_histogram(breaks = syn_data %>% pull(overall_img) %>% unique()) +
  scale_x_continuous(name = "Overall Imagery Score",
                     limits = c(1, 5)) +
  scale_y_continuous(name = "Count") +
  scale_fill_discrete(name = "Synaesthesia") +
  geom_vline(aes(xintercept = mean(overall_img)),
             colour = "purple3",
             linetype = "dashed") + 
  annotate("text", x = mean(syn_data$overall_img) + .1, y = syn_data$overall_img %>% max(table(.)) + 1,
           label = paste0("Mean: ", mean(syn_data$overall_img) %>% round(2)),
           hjust=0, colour = "purple4")
```

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-7-1.png)]

---

## Making Predictions

- Without any other information, the best estimate is the mean of the outcome

- But we *do* have more information!
  
--

- Grapheme-colour synaesthetes score higher than non-synaesthetes on overall imagery on average

- We could make a better **prediction** if we knew whether that person was a synaesthete

- Use the mean score in the synaesthete vs non-synaesthete groups
  
---

## Modeling Imagery

For non-synaesthetes, mean overall imagery = 3.25
  
  - We will treat them as the **baseline** and give them a group code of 0

```r
img_summary %>% 
  ggplot2::ggplot(aes(x = syn, y = mean_img)) +
  geom_errorbar(aes(ymin = mean_img - 2*se_img, ymax = mean_img + 2*se_img), width = .1) +
  geom_point(colour = "black", fill = "orange", pch = 23) +
  scale_y_continuous(name = "Overall Imagery Score",
                     limits = c(3, 4)) +
  labs(x = "Presence of Synaesthesia") +
  geom_label(stat = 'summary', fun.y=mean, aes(label = paste0("Mean: ", round(..y.., 2))), nudge_x = 0.1, hjust = 0) +
  cowplot::theme_cowplot()
```

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-9-1.png)]

---

## Modeling Imagery

For synaesthetes, mean overall imagery = 3.59
  
  - We will treat them as the **comparison** group and give them a group code of 1

```r
bracket <- syn_data %>%
 group_by(syn) %>%
 summarise(y = mean(overall_img)) %>%
 mutate(x = rep(2.7, 2),
 y = round(y, 2))

syn_data %>% 
  ggplot(aes(x = syn, y = overall_img)) +
  #geom_line(stat="summary", fun = mean, aes(group = NA)) +
  geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) +
  geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") +
  geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4,
             nudge_x = c(-.3, .3)) +
  geom_line(data = bracket, mapping = aes(x, y)) +
  geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) +
  annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2,
           label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)),
           hjust=0) + 
  coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,4)) +
  scale_y_continuous(name = "Overall Imagery Score") +
  labs(x = "Presence of Synaesthesia") +
  cowplot::theme_cowplot()
```

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-10-1.png)]

---

## Modeling Imagery

We want to write an equation that will give a different prediction depending on whether someone is a synaesthete or not

- `$y_{i} = b_{0} + b_{1}x_{1i} + e_{i}$`

- `$y$` = Overall imagery score

- `$x_{1}$` = Synaesthesia (0 = No, 1 = Yes)
  
- `$OverallImagery_{i} = b_{0} + b_{1}Syn_{1i}$`

- How do we find out `$b_{0}$` and `$b_{1}$`?

---

## Estimating the Line

- Where would you draw a line through these dots that best captures where they tend to fall?

```r
bracket <- syn_data %>%
 group_by(syn) %>%
 summarise(y = mean(overall_img)) %>%
 mutate(x = rep(2.7, 2),
 y = round(y, 2))

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-11-1.png)]

---

## Estimating the Line

This line is our **linear model**, with the same properties as the last one!

```r
syn_data %>% 
  ggplot(aes(x = syn, y = overall_img)) +
  geom_line(stat="summary", fun = mean, aes(group = NA)) +
  geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) +
  geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") +
  geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4,
             nudge_x = c(-.3, .3)) +
  geom_line(data = bracket, mapping = aes(x, y)) +
  geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) +
  annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2,
           label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)),
           hjust=0) + 
  coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,4)) +
  scale_y_continuous(name = "Overall Imagery Score") +
  labs(x = "Presence of Synaesthesia") +
  cowplot::theme_cowplot()
```

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-12-1.png)]

---

## Modeling Imagery

- The line starts from the **mean** of the non-synaesthete group = 3.25

- This is the **intercept**, `$b_{0}$`

- The predicted value of the outcome when the predictor is 0
  
  - Our predictor is `syn` group, where no synaesthesia = 0

- When we switch from looking at non-synaesthetes to synaesthetes, predicted overall imagery changes by 0.34

- This is the **slope** of the line, `$b_{1}$`
  
  - The change in the outcome for every **unit change** in the predictor
  
  - Here, a "unit change" means switching groups, from 0 (non-syn) to 1 (syn)
  
--

`$$OverallImagery_{i} = 3.25 + 0.34 \times Syn_{1i}$$`

---

## Using `lm()`

```
## 
## Call:
## lm(formula = overall_img ~ syn, data = syn_data)
## 
## Coefficients:
## (Intercept)       synYes  
##      3.2539       0.3361
```

```r
syn_data %>% 
  ggplot(aes(x = syn, y = overall_img)) +
  geom_line(stat="summary", fun = mean, aes(group = NA), colour = "red") +
  geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) +
  geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") +
  geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4,
             nudge_x = c(-.3, .3)) +
  geom_line(data = bracket, mapping = aes(x, y)) +
  geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) +
  annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2,
           label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)),
           hjust=0) + 
  coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,4)) +
  scale_y_continuous(name = "Overall Imagery Score") +
  labs(x = "Presence of Synaesthesia") +
  cowplot::theme_cowplot()
```

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-14-1.png)]

---

## Checking Predictions

If I wanted to **predict** the next random person's overall imagery score...

- First, ask them if they're a synaesthete or not!
  
  - "Yes" = 1, "No" = 0

`$$OverallImagery_{i} = 3.25 + 0.34 \times Syn_{1i}$$`

If yes, then `$Syn_{1i} = 1$`:

- `$OverallImagery_{i} = 3.25 + 0.34 \times 1$`
  
  - `$OverallImagery_{i} = 3.59$`
  
--

If no, then `$Syn_{1i} = 0$`:

- `$OverallImagery_{i} = 3.25 + 0.34 \times 0$`
  
  - `$OverallImagery_{i} = 3.25$`
  
--

So, we can predict imagery score based on group membership, just as we predicted masculinity score based on femininity score earlier!

---

## Welcome to the World of `lm()`

- The **l**inear **m**odel (`lm()`) will be our focus from here on out

- If this is unfamiliar to you, it's **highly recommended** that you revise linear equations!
  
  - [Visualisation on the Analysing Data website](https://and.netlify.app/viz/app/?v=reg_line&t=Linear%20equation)
  
  - [Khan Academy intro to linear equations](https://www.khanacademy.org/math/pre-algebra/xb4832e56:functions-and-linear-models/xb4832e56:linear-models/)
  
  - [Learning Statistics with R](https://learningstatisticswithr.com/lsr-0.6.pdf) - see Chapter V, Linear Regression

- Linear models will be crucial for **the rest of your degree**

---

## Summary

- The linear model expressed the relationship between at least one predictor, *x*, and an outcome, *y*

- Linear model equation: `$y_{i} = b_{0} + b_{1}x_{1i} + e_{i}$`
  
  - Key for statistical testing is the parameter `$b_{1}$`, with expresses the relationship between *x* and *y*
  
- Used to **predict** the outcome for a given value of the predictor

- Next week: LM2 - significance and model fit

- Don't forget to do the TAP!

---

.pollEv[
<iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe>
]