PDF

class: slide-zero
exclude: ![:live]
count: false
<iframe src="https://and.netlify.app/docs/masks" height="100%" width="100%" style="border:none;position:absolute;top:0;bottom:0;left:0;right0;"></iframe>
---
class: middle, inverse, title-slide

# Best guesses and uncertainty
### Dr Milan Valášek
### 31 January 2022

---

## Today

- Point estimates vs interval estimates

- Confidence intervals

- *t*-distribution

---

## What stats is about (yet again)

- We want to know about the world (population)

- We can only get data from samples

- We calculate statistics on samples and use them to *estimate* the values in population

- Statistics is all about *making inferences about populations based on samples*

- If we could measure the entire population, we wouldn't need stats!

---

## Point estimates

- You've heard of the sample mean, median, mode

- These are all point estimates - single numbers that are our best guesses about corresponding *population parameters*

- Measures of spread (<i>SD</i>, `\(\sigma^2\)`, <i>etc.</i>) are also point estimates

- Even relationships between variables can be expressed using point estimates

---

## Point estimates

.pull-left[
*r* = &minus;.07

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-2-1.png)
]

.pull-right[
*r* = .752

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-3-1.png)
]

---

## Accuracy and uncertainty

- Sample mean `\(\bar{x}\)` is the best estimate `\(\hat{\mu}\)` of population mean but means of almost all samples differ from population mean `\(\mu\)`

- Same is true for *any* point estimate

- *SE* of the _mean_ expresses the uncertainty about the estimates of the population _mean_

- *SE* can be calculated for other point estimates, not just the mean

- We can quantify uncertainty around point estimates using **interval estimates**

---

## Interval estimates

- In addition to estimating a single value, we can also estimate an interval around it

- <i>e.g.,</i> mean = 4.13 with an interval from &minus;0.2 to 8.46

- Interval estimates communicate the uncertainty around point estimates

- There are different kinds of interval estimates

- Important: **confidence intervals**

---

## Confidence interval

- We can use *SE* and the sampling distribution to calculate a confidence interval (CI) with a certain *coverage*, <i>e.g.,</i> 90%, 95%, 99%...

- For a 95% CI, 95% of these intervals around sample estimates will contain the value of the population parameter

- Let’s see an example

---

## Confidence interval

- Population of circles of different sizes

<img class="orig-colors" src="/lectures_assets/02/ci_01.png" height="350px">

---

## Confidence interval

- Sample from population, estimate mean size

<img class="orig-colors" src="/lectures_assets/02/ci_02.png" height="350px">

---

## Confidence interval

- Calculate the 95% CI around the mean

<img class="orig-colors" src="/lectures_assets/02/ci_03.png" height="350px">

---

## Confidence interval

- Lather, rinse, repeat...

<img class="gif orig-colors" src="/lectures_assets/02/ci_03.png" gif="/lectures_assets/02/ci_small.gif" height="350px">

---

## Confidence interval

- ~5% don't contain population mean = 95% coverage

<img class="orig-colors" src="/lectures_assets/02/ci_04.png" height="350px">

---

## How is it made?

- Easy if we know sampling distribution of the mean

- 95% of sampling distribution is within &plusmn;1.96 <i>SE</i>

- 95% CI around estimated population mean is mean &plusmn;1.96 <i>SE</i>

---

## How is it made?

- Sampling distribution of the mean is normal (as per [CLT](../../01/handout/#the-central-limit-theorem))

- Middle 95% of the sample means lie within &plusmn;1.96 <i>SE</i>

- We use the same 1.96 <i>SE</i> to construct 95% CI around mean

<img class="gif orig-colors" src="/lectures_assets/02/ci_constr.png" gif="/lectures_assets/02/ci_constr01_small.gif /lectures_assets/02/ci_constr02_small.gif /lectures_assets/02/ci_constr03_small.gif" height="350px">

---

exclude: ![:live]

.pollEv[
<iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe>
]

---

## How is it made?

- Sampling distribution is, however, not known!

- It can be approximated using the *t*-distribution and *s* and *N*

---

## <i>t</i>-distribution

- Symmetrical, centred around 0
- Its shape changes based on **degrees of freedom**
- As shape changes, so do proportions (unlike with normal)
- In standard normal, middle 95% of data lie within &plusmn;1.96
- In *t*-distribution, this critical value changes based on <i>df</i>

<iframe id="tdist-app" class="app" src="/viz/tdist/" data-external="1" style="transform:none;" height="385px">

---

## <i>t</i>-distribution

- *t*-distribution crops up in many situations

- Always has to do with **estimating sampling distribution from a finite sample**

- How we calculate number of <i>df</i> changes based on context

- Often has to do with *N*, number of estimated parameters, or both
  
  - In the case of sampling distribution of the mean, <i>df = N</i> &minus; 1

---

## Back to CI

- 95% CI around estimated population mean is mean &plusmn;1.96 <i>SE</i> **if we know the exact shape of sampling distribution**

- We don't know the shape so we approximate it using the *t*-distribution
  
- We need to replace the 1.96 with the appropriate critical value for a given number of <i>df</i>

- For <i>N</i> = 30, <i>t</i><sub>crit</sub>(<i>df</i>=29) = 2.05

```r
qt(p = 0.975, df = 29)
```

```
## [1] 2.04523
```

---

exclude: ![:live]

.pollEv[
<iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe>
]

---

## Back to CI

- 95% CI around the mean for a sample of 30 is `\(\bar{x} \pm 2.05\times SE\)`

- `\(\widehat{SE}=\frac{s}{\sqrt{N}}\)`

- `\(95\%\ CI = Mean\pm2.05\times \frac{s}{\sqrt{N}}\)`

- To construct a 95% CI around our estimated mean, all we need is
  - Estimated mean (<i>i.e.</i> sample mean, because `\(\hat{\mu}=\bar{x}\)`)
  - Sample *SD* (*s*)
  - *N*
  - Critical value for a *t*-distribution with <i>N</i> &minus; 1 <i>df</i>

---

## CIs are useful

- Width of the interval tells us about how much we can expect the mean of a different sample of the same size to vary from the one we got

- There's a x% chance that any given x% CI contains the true population mean

- **CAVEAT: **That's not the same as saying that there's a x% chance that the population mean lies within our x% CI!

- CIs can be calculated for *any point estimate*, not just the mean!

---

## Remember this?

.pull-left[
*r* = &minus;.07

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-5-1.png)
]

.pull-right[
*r* = .752

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-6-1.png)
]

---

## Remember this?

.pull-left[
*r* = &minus;.07; 95% CI [&minus;.263, .128]

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-8-1.png)
]

.pull-right[
*r* = .752; 95% CI [.652, .827]

![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-9-1.png)
]

---

exclude: ![:live]

.pollEv[
<iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe>
]

---
 
## Take-home message

- Our aim is to *estimate unknown population characteristics* based on samples
- *Point estimate* is the best guess about a given population characteristic (parameter)
- Estimation is inherently *uncertain*
  - We cannot say with 100% certainty that our estimate is truly equal to the population parameter
- *Confidence intervals* express this uncertainty
  - The wider they are, the more uncertainty there is
  - They have arbitrary *coverage* (often 50%, 90%, 95%, 99%)
- CIs are constructed using the *sampling distribution*
  - True sampling distribution is unknown, we can approximate it using the *t*-distribution with given *degrees of freedom*
- CIs can be constructed for *any point estimate*
- For a 95% CI, there is a 95% chance that any given CI contains the true population parameter

---

class: last-slide weekend
background-image: url("/lectures_assets/end.jpg")
background-size: cover

Notes for current slide

Notes for next slide

Best guesses and uncertainty

Lecture 2

Dr Milan Valášek

31 January 2022

1 / 25

Today

Point estimates vs interval estimates
Confidence intervals
t-distribution

2 / 25

What stats is about (yet again)

We want to know about the world (population)
We can only get data from samples
We calculate statistics on samples and use them to estimate the values in population
Statistics is all about making inferences about populations based on samples
If we could measure the entire population, we wouldn't need stats!

3 / 25

Point estimates

You've heard of the sample mean, median, mode
These are all point estimates - single numbers that are our best guesses about corresponding population parameters
Measures of spread (SD, $σ^{2}$ , etc.) are also point estimates
Even relationships between variables can be expressed using point estimates

4 / 25

Point estimates

r = −.07

r = .752

5 / 25

Accuracy and uncertainty

Sample mean $\bar{x}$ is the best estimate $\hat{μ}$ of population mean but means of almost all samples differ from population mean $μ$
Same is true for any point estimate
SE of the mean expresses the uncertainty about the estimates of the population mean
SE can be calculated for other point estimates, not just the mean
We can quantify uncertainty around point estimates using interval estimates

6 / 25

Interval estimates

In addition to estimating a single value, we can also estimate an interval around it
e.g., mean = 4.13 with an interval from −0.2 to 8.46
Interval estimates communicate the uncertainty around point estimates
There are different kinds of interval estimates
- Important: confidence intervals

7 / 25

Confidence interval

We can use SE and the sampling distribution to calculate a confidence interval (CI) with a certain coverage, e.g., 90%, 95%, 99%...
For a 95% CI, 95% of these intervals around sample estimates will contain the value of the population parameter
Let’s see an example

8 / 25

Confidence interval

Population of circles of different sizes

9 / 25

Confidence interval

Sample from population, estimate mean size

10 / 25

Confidence interval

Calculate the 95% CI around the mean

11 / 25

Confidence interval

Lather, rinse, repeat...

12 / 25

Confidence interval

~5% don't contain population mean = 95% coverage

13 / 25

How is it made?

Easy if we know sampling distribution of the mean
95% of sampling distribution is within ±1.96 SE
95% CI around estimated population mean is mean ±1.96 SE

14 / 25

How is it made?

Sampling distribution of the mean is normal (as per CLT)
Middle 95% of the sample means lie within ±1.96 SE
We use the same 1.96 SE to construct 95% CI around mean

15 / 25

How is it made?

Sampling distribution is, however, not known!
It can be approximated using the t-distribution and s and N

16 / 25

t-distribution

Symmetrical, centred around 0
Its shape changes based on degrees of freedom
As shape changes, so do proportions (unlike with normal)
In standard normal, middle 95% of data lie within ±1.96
In t-distribution, this critical value changes based on df

17 / 25

t-distribution

t-distribution crops up in many situations
Always has to do with estimating sampling distribution from a finite sample
How we calculate number of df changes based on context
- Often has to do with N, number of estimated parameters, or both
- In the case of sampling distribution of the mean, df = N − 1

18 / 25

Back to CI

95% CI around estimated population mean is mean ±1.96 SE if we know the exact shape of sampling distribution
- We don't know the shape so we approximate it using the t-distribution
We need to replace the 1.96 with the appropriate critical value for a given number of df
For N = 30, t_crit(df=29) = 2.05

qt(p = 0.975, df = 29)

## [1] 2.04523

19 / 25

Back to CI

95% CI around the mean for a sample of 30 is $\bar{x} \pm 2.05 \times S E$
$\hat{S E} = \frac{s}{\sqrt{N}}$
$95 % C I = M e a n \pm 2.05 \times \frac{s}{\sqrt{N}}$
To construct a 95% CI around our estimated mean, all we need is
- Estimated mean (i.e. sample mean, because $\hat{μ} = \bar{x}$ )
- Sample SD (s)
- N
- Critical value for a t-distribution with N − 1 df

20 / 25

CIs are useful

Width of the interval tells us about how much we can expect the mean of a different sample of the same size to vary from the one we got
There's a x% chance that any given x% CI contains the true population mean
CAVEAT: That's not the same as saying that there's a x% chance that the population mean lies within our x% CI!
CIs can be calculated for any point estimate, not just the mean!

21 / 25

Remember this?

r = −.07

r = .752

22 / 25

Remember this?

r = −.07; 95% CI [−.263, .128]

r = .752; 95% CI [.652, .827]

23 / 25

Take-home message

Our aim is to estimate unknown population characteristics based on samples
Point estimate is the best guess about a given population characteristic (parameter)
Estimation is inherently uncertain
- We cannot say with 100% certainty that our estimate is truly equal to the population parameter
Confidence intervals express this uncertainty
- The wider they are, the more uncertainty there is
- They have arbitrary coverage (often 50%, 90%, 95%, 99%)
CIs are constructed using the sampling distribution
- True sampling distribution is unknown, we can approximate it using the t-distribution with given degrees of freedom
CIs can be constructed for any point estimate
For a 95% CI, there is a 95% chance that any given CI contains the true population parameter

24 / 25

25 / 25

Today

Point estimates vs interval estimates
Confidence intervals
t-distribution

2 / 25

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow