Chi-square

PDF

# Chi-square
### Dr Jennifer Mankin
### 28 February 2022

---

## Looking Ahead (and Behind)

- Last week: Correlation

- This week: Chi-Square ( `$\chi^2$` )

- Week 6: *t*-test

- Week 7: The Linear Model

- Week 8: The Linear Model

---

## Lab Report: Green Study

- Today we will talk about one of the analyses for the lab report

- `$\chi^2$` : Green study (Griskevicious et al., 2010)
  
  - *t*-test: Red study (Elliott et al., 2020), next week

- We will talk about the lab report in the lectures **and** work on it in the practicals

- Make sure you come to your registered sessions
  
---

## Objectives

After this lecture you will understand:

- The concepts behind tests of goodness-of-fit and association

- How to calculate the `$\chi^2$` statistic

- How to read tables and figures of counts

- How to interpret and report significance tests of `$\chi^2$`

- The relationship between association and causation

---

## Roll of the Dice

---

## Roll of the Dice

- I want to know if my four-sided die (d4) is fair

- If it is, each number should come up with equal probability

- Four numbers = 100%/4 = 25% probability of rolling each number

- So, if I roll the dice 100 times, each number should come up (approximately) 25 times

---

## Roll of the Dice

![Photo of a four-sided die in a padded box, next to a piece of paper with tally marks counting rolls](/lectures_assets/06/dice.jpg)

]

.pull-right[
<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Dice Roll </th>
   <th style="text-align:right;"> Observed Count </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:right;"> 25 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:right;"> 29 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 3 </td>
   <td style="text-align:right;"> 24 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:right;"> 22 </td>
  </tr>
</tbody>
</table>
]

---

## A Fair Shake?

- These numbers are not *exactly* 25/25/25/25

- But we live in a random universe!
  
- How different is **different enough** to believe that the die is not actually fair?

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-2-1.png)

---

### Steps of the Analysis

- Calculate the (standardised) *difference* between observed and expected frequencies

- Compare that *test statistic* to its distribution under the null hypothesis

- Obtain the probability *p* of encountering a test statistic of the size we have, or larger, if the null hypothesis is true

- ?????

- Profit

---

## Step 1: Calculate a Test Statistic

- How different are the *observed* counts from the *expected* counts?

.left[
<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Dice Roll </th>
   <th style="text-align:right;"> Obs. Count </th>
   <th style="text-align:right;"> Exp. Count </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:right;"> 25 </td>
   <td style="text-align:right;"> 25 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:right;"> 29 </td>
   <td style="text-align:right;"> 25 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 3 </td>
   <td style="text-align:right;"> 24 </td>
   <td style="text-align:right;"> 25 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:right;"> 22 </td>
   <td style="text-align:right;"> 25 </td>
  </tr>
</tbody>
</table>
]

.center[
`$\chi^2 = \frac{(25-25)^2}{25} + \frac{(29-25)^2}{25} + \frac{(24-25)^2}{25} + \frac{(22-25)^2}{25}$`
]

.center[
The total squared (and scaled) difference between *observed* and *expected* counts is the sum of those four numbers, or 1.04
]

---

## Step 2: Compare to the Distribution

- We've calculated a *test statistic* that represents the thing we are trying to test

- Is this test statistic big or small in the grand scheme of things?
  
--

- Compare our test statistic to the *distribution* of similar statistics

- **IMPORTANT**: These distributions assume that *the null hypothesis is true!*
  
---

## The Chi-Square (*&chi;*<sup>2</sup>) Distribution

- Unfortunately test statistics like the one we have are **not** normally distributed

- No problem - we just have to use a different distribution!

- Meet the `$\chi^2$` distribution

- The sum of squared normal distributions
  
  - See [this excellent Khan Academy explainer](https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests/chi-square-goodness-of-fit-tests/v/chi-square-distribution-introduction) for more!
  
<iframe id="chisq-app" class="app" src="https://and.netlify.app/viz/chisq.html" height=680px></iframe>

---

## Detour: Degrees of Freedom

- Degrees of freedom are calculated differently for different test statistics

- Important because they determine the distribution's shape and proportions
  
- At base, they are the number of values that are *free to vary*

- Consider our dice example...

- We know our test statistic is 1.04
  
  - If we know the first three values (0 + 0.64 + 0.04), the last value **must** be 0.36
  
  - Alternatively, if we had three random values (e.g. 0.23 + 0.54 + 0.1), the last value cannot be random: it **must** be 0.17 to add up to 1.04
  
  - So, we have **three** degrees of freedom
  
---
  
## Step 3: Obtain the Probability *p*

- Look at the distribution for 3 degrees of freedom

- What percentage of the distribution is greater than or equal to 1.04?

---

## Interpreting the Results

- The sum of squared differences between our *expected* and *observed* counts ( `$\chi^2$` ) was 1.04

- For a `$\chi^2$` distribution with 3 degrees of freedom, this value is **extremely common** under the null hypothesis!

- If our die **is** fair, our data are extremely likely
  
  - To believe that the die was *not* fair, we would have needed a test statistic of ~7.8 or greater ( `$\alpha$` = .05)

- If only there were an easier way to do this...!

```r
chisq.test(dice_table$obs_count)
```

```
## 
## 	Chi-squared test for given probabilities
## 
## data:  dice_table$obs_count
## X-squared = 1.04, df = 3, p-value = 0.7916
```

---

## Interim Summary

- The `$\chi^2$` test statistic quantifies how different a set of *observed* frequencies are from *expected* frequencies

- We obtain the probability *p* of finding the test statistic we have calculated (or one even larger) using  the distribution of the `$\chi^2$` statistic under the null hypothesis, with a given number of degrees of freedom

- Given an `$\alpha$` level of .05...

- If *p* > .05, we conclude that our results are *likely* to occur under the null hypothesis, so we have no evidence that the null hypothesis is not true

- If *p* < .05, we conclude that our results are sufficiently *unlikely* to occur that it may in fact be the case that the null hypothesis is *not* true

---

.pollEv[
<iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe>
]

---

## More *&chi;<sup>2</sup>*

- We just saw a *goodness of fit* test

- Tests whether a sample of data came from a population with a specific distribution

- Next, let's look at a *test of association*, or *independence*

- Are two categorical variables associated or not?
  
--

- For your lab reports, you will again write about the Green or Red studies

- You can freely choose which!
  
  - If you choose the Green study, **this is the test you will use**

---

## Quick Refresher: Variable Types

- **Continuous** data

- Represent some measurement or score on a scale
  
  - Examples: ratings of romantic attraction, age in years
  
  - Answers the question: *how much?*

- **Categorical** data

- Represent membership in a particular group or condition
  
  - Examples: control vs experimental group, year of uni
  
  - Answers the question: *which one?*

---

## *&chi;<sup>2</sup>* Test of Association

- This time we will have two variables, both categorical

- Data: counts of how many observations fall into each combination of categories

---

## Sequence-Space Synaesthesia

- Spatial orientation of sequences, such as numbers, months, or days of the week

.pull-left[
<iframe width="560" height="315" src="https://www.youtube.com/embed/BWGNWgBk76k?start=1486" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
]

.pull-right[
![Drawing of a human figure in the centre of a circle made up of coloured segments, each labeled with a month of the year in order](data:image/png;base64,#https://images.ctfassets.net/cnu0m8re1exe/0XiQXU9wikCY4saHF2IQy/6eef3d4687e8b297d2252594a18149bf/IVelspu.jpg?w=650&h=433&fit=fill)
[Image Source](https://www.discovermagazine.com/health/the-rare-humans-who-see-time-and-have-amazing-memories)
]

---

## Sequence-Space Synaesthesia

- "Calendars" of spatial orientations of months of the year

- [Brang et al. (2011)](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1111/j.1748-6653.2011.02012.x?casa_token=xDJyLXaLC4QAAAAA%3Ak7PsUINKvtuZev5SHRKG8fG9uwq739qOI_knrdYKwIOUepYVZMwxo1YBhQS559jjAZP5YGspZXUEg5JU): Is the orientation of the calendar related to the synaesthete's handedness?

- Orientation: months progress clockwise or counterclockwise in space
  
  - Handedness: left or right handed
  
--

- Each synaesthete has one value for orientation and one value for handedness

- Data: counts of *how many synaesthetes* fall into each combination of categories

]

## Let's Think About This...

- What is the **null hypothesis** in this case?

- What is the **alternative hypothesis**?

- What do you think we will find?

---

## Let's Think About This...

- **Null hypothesis**: Calendar orientation is *not* associated with synaesthete handedness

- **Alternative hypothesis**: Calendar orientation *is* associated with synaesthete handedness

- Prediction from the paper:

- Right-handed synaesthetes will tend to have a clockwise calendar
  
  - Left-handed synaesthetes will tend to have an anticlockwise calendar

---

## Visualising the Data

```r
ggplot(ss_tab, aes(x = handedness, y = n)) +
  geom_bar(
    aes(fill = orientation), 
    stat="identity", position = position_dodge(0.8),
    width = 0.7) +
  labs(x = "Handedness", y = "Frequency", fill = "Calendar\nOrientation") +
  scale_y_continuous(limits = c(0, 20)) +
  scale_color_manual(values = c("#009FA7", "#52006F"))+
  scale_fill_manual(values = c("#009FA7", "#52006F"), labels = c("Anticlockwise", "Clockwise"))+
  scale_x_discrete(labels = c("Left","Right"))
```

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-6-1.png)]

- Left-handed synaesthetes have more anti-clockwise than clockwise
  
  - Right-handed synaesthetes have the reverse

---

## Test Result

Are these data different enough from the expected frequencies to believe that there may be an **association** between orientation and handedness?

```
## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  seq_space$orientation and seq_space$handedness
## X-squared = 9.7798, df = 1, p-value = 0.001764
```
<br><br><br>

---

## Test Result

Are these data different enough from the expected frequencies to believe that there may be an **association** between orientation and handedness?

```
## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  seq_space$orientation and seq_space$handedness
## X-squared = 9.7798, df = 1, p-value = 0.001764
```

<font size = "+2">
.center["There was a significant association between calendar orientation and handedness ( `$\chi^2$`(1) = 9.78, *p* = .002)."]
</font>

---

## Interpreting the Result

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-9-1.png)]

- Our hypothesis is supported by the data

- The association is in the *direction* we predicted

---

## Expected Frequencies

- Are these data different enough from the **expected frequencies** to believe that there may be an association between orientation and handedness?

- We can get these easily out of R!
  
--

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Orientation </th>
   <th style="text-align:right;"> Left </th>
   <th style="text-align:right;"> Right </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Anti-Clockwise </td>
   <td style="text-align:right;"> 3.53 </td>
   <td style="text-align:right;"> 8.47 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Clockwise </td>
   <td style="text-align:right;"> 6.47 </td>
   <td style="text-align:right;"> 15.53 </td>
  </tr>
</tbody>
</table>

- One of the **assumptions** of `$\chi^2$` is that all expected frequencies are greater than 5

- Otherwise this test can give you a drastically wrong answer 😱
  
  - In this case, use Fisher's exact test (`fisher.test()`) instead

---

## Final Overview

- The `$\chi^2$` test quantifies the difference between observed and expected frequencies

- Goodness of Fit
  
  - Tests whether a sample of data came from a population with a specific distribution 🎲
  
- Test of Association/Independence
  
  - Tests whether two categorical variables are associated with each other 🌈

- Like with correlation, **association is not causation**

- For quizzes/exam:

- You will **not** be expected to calculate `$\chi^2$` by hand!
  
  - You **will** be expected to read and interpret the output of `chisq.test()` for tests of association
  
  - More in the tutorial!

---

## Lab Reports

- You can choose either the red or green study to write your report on
  
  - See [Lab Report Information on Canvas](https://canvas.sussex.ac.uk/courses/9242/pages/Lab%20Report%20Information%20and%20Resources?titleize=0) for more
  
- If you choose the green study (Griskevicius et al., 2010), you must use and report the results of `$\chi^2$`

--
  
- Choose one of three products to analyse

- Report observed frequencies and `$\chi^2$` result

- Include a figure of the results
  
  - Will be covered in depth in the next tutorial and practical!

---

.pollEv[
<iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe>
]

---