Lectures
▾
Lecture 1
Lecture 2
Lecture 3
Lecture 4
Lecture 5
Lecture 6
Lecture 7
Lecture 8
Lecture 9
Lecture 10
Lecture 11
Skills Lab
▾
Skills lab 1
Skills lab 2
Skills lab 3
Skills lab 4
Skills lab 5
Skills lab 6
Skills lab 7
Skills lab 8
Skills lab 9
Skills lab 10
Practicals
▾
Practical 1
Practical 2
Practical 3
Practical 4
Practical 5
Practical 6
Practical 7
Practical 8
Practical 9
Practical 10
Practical 11
Tutorials
▾
Tutorial 0
Tutorial 1
Tutorial 2
Tutorial 3
Tutorial 4
Tutorial 5
Tutorial 6
Tutorial 7
Tutorial 8
Tutorial 9
Tutorial 10
More
▾
Documents
Visualisations
About
This is the 2022 version of the Analysing Data website. For the current incarnation,
click here
.
PDF
class: middle, inverse, title-slide # Linear Model 1: A New Equation ### Dr Jennifer Mankin ### 7 March 2022 --- ## Overview - Reminder: the TAP! - The Linear Model - What is modeling? - Model with continuous predictor - Model with categorical predictor ## Reminder: The TAP The **take-away paper** is currently live! - See [Take-Away Paper Information](https://canvas.sussex.ac.uk/courses/17281/pages/take-away-paper-information): - Download the Rmd document to complete - All information on preparing and submitting the assessment - All necessary background information, tips, and FAQs - If you encounter a technical problem, [book onto the help desk](https://canvas.sussex.ac.uk/courses/17281/external_tools/7463) - Two additional hours from 2:30-4:30 on Wednesday the 7th --- ## Objectives After this lecture you will understand: - What a statistical model is and why they are useful - The equation for a linear model with one predictor - b<sub>0</sub> (the intercept) - b<sub>1</sub> (the slope) - Using the equation to predict an outcome - How to read scatterplots and lines of best fit --- ## The Linear Model - Extremely common and fundamental testing paradigm - Predict the outcome *y* from one or more predictors (*x*s) - Our first (explicit) contact with statistical modeling -- - A **statistical model** is a mathematical expression that captures the relationship between variables - All of our test statistics are actually models! --- ## Maps as Models - A map is a simplified depiction of the world - Captures the important elements (roads, cities, oceans, mountains) - *Doesn't* capture individual detail (where your gran lives) -- - Depicts **relationships** between locations and geographical features - Helps you **predict** what you will encounter in the world - E.g. if you keep walking south eventually you'll fall in the sea! --- ## Statistical Models - A model is a simplified depiction of some relationship - We want to **predict** what will happen in the world - But the world is complex and full of noise (randomness) -- - We can build a model to try to capture the important elements - Gather a sample that (we assume) is representative of the population - Investigate and quantify the relationships in that sample (ie construct a model) - Change/adjust the model to see what might happen with different parameters --- ## Statistical Models - **Why** might it be useful to create a model like this? - Can you think of any recent examples of such models? - [One example of modelling you might all be familiar with!](https://covid19.healthdata.org/global?view=cumulative-deaths&tab=trend) --- ## Predictors and Outcomes - Now we start assigning our variables roles to play - The **outcome** is the variable we want to explain - Also called the dependent variable, or DV - The **predictors** are variables that may have a relationship with the outcome - Also called the independent variable(s), or IV(s) - We measure or manipulate the predictors, then quantify the systematic change in the outcome - NB: **YOU** (the researcher) assign these roles! --- ## General Model Equation <span style = "font-size:1.8em"> `$$outcome = model + error$$` </span> - We can use models to **predict** the outcome for a particular case - This is always subject to some degree of **error** --- ## Linear Model Equation <span style = "font-size:1.8em"> `$$y_{i} = b_{0} + b_{1}x_{1i} + e_{i}$$` </center></span> - `\(y_{i}\)`: the predicted value of the outcome - `\(b_{0}\)`: the intercept - `\(b_{1}\)`: the slope - `\(x_{1i}\)`: the predictor - `\(e_{i}\)`: the error in prediction <center>You may know her as `\(y = ax+ b\)`!</center> --- ## Linear Model Equation <span style = "font-size:1.8em"> `$$y_{i} = b_{0} + b_{1}x_{1i} + e_{i}$$` </span> - We will next see: - How we can create a line that captures the relationship between those two variables - How we can adapt this general LM equation to describe that line --- ## Visualising the Line - Where would you draw a line through these dots that best captures where they tend to fall? .codePanel[ ```r gensex %>% mutate(gender = fct_explicit_na(gender)) %>% ggplot(aes(x = gender_fem, y = gender_masc)) + geom_point(position = "jitter", size = 2, alpha = .4) + scale_x_continuous(name = "Femininity", breaks = c(0:9)) + scale_y_continuous(name = "Masculinity", breaks = c(0:9)) + cowplot::theme_cowplot() ``` ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-1-1.png)<!-- -->] --- ## Visualising the Line .codePanel[ ```r gensex %>% mutate(gender = fct_explicit_na(gender)) %>% ggplot(aes(x = gender_fem, y = gender_masc)) + geom_point(position = "jitter", size = 2, alpha = .4) + geom_smooth(method = "lm", formula = y ~ x) + scale_x_continuous(name = "Femininity", breaks = c(0:9)) + scale_y_continuous(name = "Masculinity", breaks = c(0:9)) + cowplot::theme_cowplot() ``` ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-2-1.png)<!-- -->] --- ## Visualising the Line - The data points tend to be higher up on the right and lower down on the left - So as the variable on *x* (here, ratings of femininity) increases... - The variable on *y* (here, ratings of masculinity) tends to decrease - This represents a **negative relationship** between *x* and *y*: as one goes up, the other goes down - Our line captures this by going downwards from left to right --- ## Visualising the Line - Two key *parameters*: where the line starts, and its slope .codePanel[ ```r gensex %>% mutate(gender = fct_explicit_na(gender)) %>% ggplot(aes(x = gender_fem, y = gender_masc)) + geom_point(position = "jitter", size = 2, alpha = .4) + geom_smooth(method = "lm", formula = y ~ x) + scale_x_continuous(name = "Femininity", breaks = c(0:9)) + scale_y_continuous(name = "Masculinity", breaks = c(0:9)) + cowplot::theme_cowplot() ``` ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-3-1.png)<!-- -->] --- ## Modeling Gender Ratings We can make some estimates: - The line would cross the *y*-axis somewhere between 8 and 9 (close to 9) - `\(b_{0} \approx 8.5\)` -- - Every time we go up one point on the femininity scale, masculinity goes down by a little less than one point - `\(b_{1} \approx -0.8\)` --- ## Modeling Gender Ratings <span style = "font-size:1.8em"> `$$y_{i} = b_{0} + b_{1}x_{1i} + e_{i}$$` </span> - `\(y_{i}\)` (outcome): Masculinity - `\(x_{1i}\)` (predictor): Femininity - `\(b_{0}\)` (intercept): the predicted value of masculinity when femininity is 0 - `\(b_{1}\)` (slope): **change** in masculinity associated with a **unit change** in femininity -- <span style = "font-size:1.8em"> `$$Masculinity_{i} = b_{0} + b_{1}Femininity_{1i} + e_{i}$$` </span> --- ## Modeling Gender Ratings How do we get the real numbers? -- ``` ## ## Call: ## lm(formula = gender_masc ~ gender_fem, data = gensex) ## ## Coefficients: ## (Intercept) gender_fem ## 8.8246 -0.7976 ``` Adapt our equation to include the real *b* values: <span style = "font-size:1.5em"> `$$Masculinity_{i} = 8.82 -0.8\times Femininity_{1i} + e_{i}$$` </span> --- ## Predicting Gender - We can now use this model to **predict** someone's rating of masculinity, if we know their rating of femininity - someone who doesn't identify strongly with femininity: `gender_fem` = 3 - What would the model **predict** for this person's masculinity rating? <span style = "font-size:1.8em"> `$$Masculinity_{i} = 8.82 -0.8\times Femininity_{1i}$$` </span> --- ## Predicting Gender `\(Masculinity_{i} = 8.82 -0.8\times Femininity_{1i}\)` - `\(Masculinity_{i} = 8.82 -0.8\times 3\)` - `\(Masculinity_{i} = 6.42\)` So, someone with femininity = 3 is **predicted** to have a masculinity rating of 6.42 - This is subject to some (unknowable!) degree of error --- ## Predicting Gender Someone with a femininity rating of 3 is **predicted** to have a masculinity rating of 6.42 .codePanel[ ```r gensex %>% mutate(gender = fct_explicit_na(gender)) %>% ggplot(aes(x = gender_fem, y = gender_masc)) + geom_point(position = "jitter", size = 2, alpha = .4) + geom_smooth(method = "lm", formula = y ~ x) + geom_vline(xintercept = 3, linetype = "dashed") + geom_hline(yintercept = 6.42, linetype = "dashed") + scale_x_continuous(name = "Femininity", breaks = c(0:9)) + scale_y_continuous(name = "Masculinity", breaks = c(0:9)) + cowplot::theme_cowplot() ``` ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-6-1.png)<!-- -->] --- ## Interim Summary - The linear model predicts the outcome *y* based on a predictor *x* - General form: `\(y_{i} = b_{0} + b_{1}x_{1i} + e_{i}\)` - `\(b_{0}\)`, the intercept, is the value of *y* when *x* is 0 - `\(b_{1}\)`, the slope, is the change in *y* for every unit change in *x* - The slope, `\(b_{1}\)`, is the key piece of information, because it represents the relationship between the predictor and the outcome -- - Up next: categorical predictors --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## Words and Colours In [Tutorial 5](https://and.netlify.app/tutorials/05), we looked at synaesthesia and imagery - Let's revisit those ideas using the linear model! -- - If I wanted to **predict** the next random person's overall imagery score... - What would be the most sensible *estimate*? --- ## Making Predictions .codePanel[ ```r syn_data %>% ggplot(aes(x = overall_img)) + geom_histogram(breaks = syn_data %>% pull(overall_img) %>% unique()) + scale_x_continuous(name = "Overall Imagery Score", limits = c(1, 5)) + scale_y_continuous(name = "Count") + scale_fill_discrete(name = "Synaesthesia") + geom_vline(aes(xintercept = mean(overall_img)), colour = "purple3", linetype = "dashed") + annotate("text", x = mean(syn_data$overall_img) + .1, y = syn_data$overall_img %>% max(table(.)) + 1, label = paste0("Mean: ", mean(syn_data$overall_img) %>% round(2)), hjust=0, colour = "purple4") ``` ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-7-1.png)<!-- -->] --- ## Making Predictions - Without any other information, the best estimate is the mean of the outcome - But we *do* have more information! -- - Grapheme-colour synaesthetes score higher than non-synaesthetes on overall imagery on average - We could make a better **prediction** if we knew whether that person was a synaesthete - Use the mean score in the synaesthete vs non-synaesthete groups --- ## Modeling Imagery For non-synaesthetes, mean overall imagery = 3.25 - We will treat them as the **baseline** and give them a group code of 0 .codePanel[ ```r img_summary %>% ggplot2::ggplot(aes(x = syn, y = mean_img)) + geom_errorbar(aes(ymin = mean_img - 2*se_img, ymax = mean_img + 2*se_img), width = .1) + geom_point(colour = "black", fill = "orange", pch = 23) + scale_y_continuous(name = "Overall Imagery Score", limits = c(3, 4)) + labs(x = "Presence of Synaesthesia") + geom_label(stat = 'summary', fun.y=mean, aes(label = paste0("Mean: ", round(..y.., 2))), nudge_x = 0.1, hjust = 0) + cowplot::theme_cowplot() ``` ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-9-1.png)<!-- -->] --- ## Modeling Imagery For synaesthetes, mean overall imagery = 3.59 - We will treat them as the **comparison** group and give them a group code of 1 .codePanel[ ```r bracket <- syn_data %>% group_by(syn) %>% summarise(y = mean(overall_img)) %>% mutate(x = rep(2.7, 2), y = round(y, 2)) syn_data %>% ggplot(aes(x = syn, y = overall_img)) + #geom_line(stat="summary", fun = mean, aes(group = NA)) + geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) + geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") + geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4, nudge_x = c(-.3, .3)) + geom_line(data = bracket, mapping = aes(x, y)) + geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) + annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2, label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)), hjust=0) + coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,4)) + scale_y_continuous(name = "Overall Imagery Score") + labs(x = "Presence of Synaesthesia") + cowplot::theme_cowplot() ``` ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-10-1.png)<!-- -->] --- ## Modeling Imagery We want to write an equation that will give a different prediction depending on whether someone is a synaesthete or not - `\(y_{i} = b_{0} + b_{1}x_{1i} + e_{i}\)` - `\(y\)` = Overall imagery score - `\(x_{1}\)` = Synaesthesia (0 = No, 1 = Yes) - `\(OverallImagery_{i} = b_{0} + b_{1}Syn_{1i}\)` - How do we find out `\(b_{0}\)` and `\(b_{1}\)`? --- ## Estimating the Line - Where would you draw a line through these dots that best captures where they tend to fall? .codePanel[ ```r bracket <- syn_data %>% group_by(syn) %>% summarise(y = mean(overall_img)) %>% mutate(x = rep(2.7, 2), y = round(y, 2)) syn_data %>% ggplot(aes(x = syn, y = overall_img)) + #geom_line(stat="summary", fun = mean, aes(group = NA)) + geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) + geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") + geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4, nudge_x = c(-.3, .3)) + geom_line(data = bracket, mapping = aes(x, y)) + geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) + annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2, label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)), hjust=0) + coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,4)) + scale_y_continuous(name = "Overall Imagery Score") + labs(x = "Presence of Synaesthesia") + cowplot::theme_cowplot() ``` ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-11-1.png)<!-- -->] --- ## Estimating the Line This line is our **linear model**, with the same properties as the last one! .codePanel[ ```r syn_data %>% ggplot(aes(x = syn, y = overall_img)) + geom_line(stat="summary", fun = mean, aes(group = NA)) + geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) + geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") + geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4, nudge_x = c(-.3, .3)) + geom_line(data = bracket, mapping = aes(x, y)) + geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) + annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2, label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)), hjust=0) + coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,4)) + scale_y_continuous(name = "Overall Imagery Score") + labs(x = "Presence of Synaesthesia") + cowplot::theme_cowplot() ``` ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-12-1.png)<!-- -->] --- ## Modeling Imagery - The line starts from the **mean** of the non-synaesthete group = 3.25 - This is the **intercept**, `\(b_{0}\)` - The predicted value of the outcome when the predictor is 0 - Our predictor is `syn` group, where no synaesthesia = 0 -- - When we switch from looking at non-synaesthetes to synaesthetes, predicted overall imagery changes by 0.34 - This is the **slope** of the line, `\(b_{1}\)` - The change in the outcome for every **unit change** in the predictor - Here, a "unit change" means switching groups, from 0 (non-syn) to 1 (syn) -- <span style = "font-size:1.8em"> `$$OverallImagery_{i} = 3.25 + 0.34 \times Syn_{1i}$$` </span> --- ## Using `lm()` ``` ## ## Call: ## lm(formula = overall_img ~ syn, data = syn_data) ## ## Coefficients: ## (Intercept) synYes ## 3.2539 0.3361 ``` .codePanel[ ```r syn_data %>% ggplot(aes(x = syn, y = overall_img)) + geom_line(stat="summary", fun = mean, aes(group = NA), colour = "red") + geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) + geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") + geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4, nudge_x = c(-.3, .3)) + geom_line(data = bracket, mapping = aes(x, y)) + geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) + annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2, label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)), hjust=0) + coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,4)) + scale_y_continuous(name = "Overall Imagery Score") + labs(x = "Presence of Synaesthesia") + cowplot::theme_cowplot() ``` ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-14-1.png)<!-- -->] --- ## Checking Predictions If I wanted to **predict** the next random person's overall imagery score... - First, ask them if they're a synaesthete or not! - "Yes" = 1, "No" = 0 `$$OverallImagery_{i} = 3.25 + 0.34 \times Syn_{1i}$$` -- If yes, then `\(Syn_{1i} = 1\)`: - `\(OverallImagery_{i} = 3.25 + 0.34 \times 1\)` - `\(OverallImagery_{i} = 3.59\)` -- If no, then `\(Syn_{1i} = 0\)`: - `\(OverallImagery_{i} = 3.25 + 0.34 \times 0\)` - `\(OverallImagery_{i} = 3.25\)` -- So, we can predict imagery score based on group membership, just as we predicted masculinity score based on femininity score earlier! --- ## Welcome to the World of `lm()` - The **l**inear **m**odel (`lm()`) will be our focus from here on out - If this is unfamiliar to you, it's **highly recommended** that you revise linear equations! - [Visualisation on the Analysing Data website](https://and.netlify.app/viz/app/?v=reg_line&t=Linear%20equation) - [Khan Academy intro to linear equations](https://www.khanacademy.org/math/pre-algebra/xb4832e56:functions-and-linear-models/xb4832e56:linear-models/) - [Learning Statistics with R](https://learningstatisticswithr.com/lsr-0.6.pdf) - see Chapter V, Linear Regression - Linear models will be crucial for **the rest of your degree** --- ## Summary - The linear model expressed the relationship between at least one predictor, *x*, and an outcome, *y* - Linear model equation: `\(y_{i} = b_{0} + b_{1}x_{1i} + e_{i}\)` - Key for statistical testing is the parameter `\(b_{1}\)`, with expresses the relationship between *x* and *y* - Used to **predict** the outcome for a given value of the predictor - Next week: LM2 - significance and model fit - Don't forget to do the TAP! --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ]
class: slide-zero exclude: ![:live] count: false
---