Lectures
▾
Lecture 1
Lecture 2
Lecture 3
Lecture 4
Lecture 5
Lecture 6
Lecture 7
Lecture 8
Lecture 9
Lecture 10
Lecture 11
Skills Lab
▾
Skills lab 1
Skills lab 2
Skills lab 3
Skills lab 4
Skills lab 5
Skills lab 6
Skills lab 7
Skills lab 8
Skills lab 9
Skills lab 10
Practicals
▾
Practical 1
Practical 2
Practical 3
Practical 4
Practical 5
Practical 6
Practical 7
Practical 8
Practical 9
Practical 10
Practical 11
Tutorials
▾
Tutorial 0
Tutorial 1
Tutorial 2
Tutorial 3
Tutorial 4
Tutorial 5
Tutorial 6
Tutorial 7
Tutorial 8
Tutorial 9
Tutorial 10
More
▾
Documents
Visualisations
About
This is the 2022 version of the Analysing Data website. For the current incarnation,
click here
.
PDF
class: middle, inverse, title-slide # Best guesses and uncertainty ### Dr Milan Valášek ### 31 January 2022 --- ## Today - Point estimates vs interval estimates - Confidence intervals - *t*-distribution --- ## What stats is about (yet again) - We want to know about the world (population) - We can only get data from samples - We calculate statistics on samples and use them to *estimate* the values in population - Statistics is all about *making inferences about populations based on samples* - If we could measure the entire population, we wouldn't need stats! --- ## Point estimates - You've heard of the sample mean, median, mode - These are all point estimates - single numbers that are our best guesses about corresponding *population parameters* - Measures of spread (<i>SD</i>, `\(\sigma^2\)`, <i>etc.</i>) are also point estimates - Even relationships between variables can be expressed using point estimates --- ## Point estimates .pull-left[ *r* = −.07 ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] .pull-right[ *r* = .752 ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- ## Accuracy and uncertainty - Sample mean `\(\bar{x}\)` is the best estimate `\(\hat{\mu}\)` of population mean but means of almost all samples differ from population mean `\(\mu\)` - Same is true for *any* point estimate - *SE* of the _mean_ expresses the uncertainty about the estimates of the population _mean_ - *SE* can be calculated for other point estimates, not just the mean - We can quantify uncertainty around point estimates using **interval estimates** --- ## Interval estimates - In addition to estimating a single value, we can also estimate an interval around it - <i>e.g.,</i> mean = 4.13 with an interval from −0.2 to 8.46 - Interval estimates communicate the uncertainty around point estimates - There are different kinds of interval estimates - Important: **confidence intervals** --- ## Confidence interval - We can use *SE* and the sampling distribution to calculate a confidence interval (CI) with a certain *coverage*, <i>e.g.,</i> 90%, 95%, 99%... - For a 95% CI, 95% of these intervals around sample estimates will contain the value of the population parameter - Let’s see an example --- ## Confidence interval - Population of circles of different sizes <img class="orig-colors" src="/lectures_assets/02/ci_01.png" height="350px"> --- ## Confidence interval - Sample from population, estimate mean size <img class="orig-colors" src="/lectures_assets/02/ci_02.png" height="350px"> --- ## Confidence interval - Calculate the 95% CI around the mean <img class="orig-colors" src="/lectures_assets/02/ci_03.png" height="350px"> --- ## Confidence interval - Lather, rinse, repeat... <img class="gif orig-colors" src="/lectures_assets/02/ci_03.png" gif="/lectures_assets/02/ci_small.gif" height="350px"> --- ## Confidence interval - ~5% don't contain population mean = 95% coverage <img class="orig-colors" src="/lectures_assets/02/ci_04.png" height="350px"> --- ## How is it made? - Easy if we know sampling distribution of the mean - 95% of sampling distribution is within ±1.96 <i>SE</i> - 95% CI around estimated population mean is mean ±1.96 <i>SE</i> --- ## How is it made? - Sampling distribution of the mean is normal (as per [CLT](../../01/handout/#the-central-limit-theorem)) - Middle 95% of the sample means lie within ±1.96 <i>SE</i> - We use the same 1.96 <i>SE</i> to construct 95% CI around mean <img class="gif orig-colors" src="/lectures_assets/02/ci_constr.png" gif="/lectures_assets/02/ci_constr01_small.gif /lectures_assets/02/ci_constr02_small.gif /lectures_assets/02/ci_constr03_small.gif" height="350px"> --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## How is it made? - Sampling distribution is, however, not known! - It can be approximated using the *t*-distribution and *s* and *N* --- ## <i>t</i>-distribution - Symmetrical, centred around 0 - Its shape changes based on **degrees of freedom** - As shape changes, so do proportions (unlike with normal) - In standard normal, middle 95% of data lie within ±1.96 - In *t*-distribution, this critical value changes based on <i>df</i> <iframe id="tdist-app" class="app" src="/viz/tdist/" data-external="1" style="transform:none;" height="385px"> --- ## <i>t</i>-distribution - *t*-distribution crops up in many situations - Always has to do with **estimating sampling distribution from a finite sample** - How we calculate number of <i>df</i> changes based on context - Often has to do with *N*, number of estimated parameters, or both - In the case of sampling distribution of the mean, <i>df = N</i> − 1 --- ## Back to CI - 95% CI around estimated population mean is mean ±1.96 <i>SE</i> **if we know the exact shape of sampling distribution** - We don't know the shape so we approximate it using the *t*-distribution - We need to replace the 1.96 with the appropriate critical value for a given number of <i>df</i> - For <i>N</i> = 30, <i>t</i><sub>crit</sub>(<i>df</i>=29) = 2.05 ```r qt(p = 0.975, df = 29) ``` ``` ## [1] 2.04523 ``` --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## Back to CI - 95% CI around the mean for a sample of 30 is `\(\bar{x} \pm 2.05\times SE\)` - `\(\widehat{SE}=\frac{s}{\sqrt{N}}\)` - `\(95\%\ CI = Mean\pm2.05\times \frac{s}{\sqrt{N}}\)` - To construct a 95% CI around our estimated mean, all we need is - Estimated mean (<i>i.e.</i> sample mean, because `\(\hat{\mu}=\bar{x}\)`) - Sample *SD* (*s*) - *N* - Critical value for a *t*-distribution with <i>N</i> − 1 <i>df</i> --- ## CIs are useful - Width of the interval tells us about how much we can expect the mean of a different sample of the same size to vary from the one we got - There's a x% chance that any given x% CI contains the true population mean - **CAVEAT: **That's not the same as saying that there's a x% chance that the population mean lies within our x% CI! - CIs can be calculated for *any point estimate*, not just the mean! --- ## Remember this? .pull-left[ *r* = −.07 ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] .pull-right[ *r* = .752 ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] --- ## Remember this? .pull-left[ *r* = −.07; 95% CI [−.263, .128] ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] .pull-right[ *r* = .752; 95% CI [.652, .827] ![](data:image/png;base64,#slides_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## Take-home message - Our aim is to *estimate unknown population characteristics* based on samples - *Point estimate* is the best guess about a given population characteristic (parameter) - Estimation is inherently *uncertain* - We cannot say with 100% certainty that our estimate is truly equal to the population parameter - *Confidence intervals* express this uncertainty - The wider they are, the more uncertainty there is - They have arbitrary *coverage* (often 50%, 90%, 95%, 99%) - CIs are constructed using the *sampling distribution* - True sampling distribution is unknown, we can approximate it using the *t*-distribution with given *degrees of freedom* - CIs can be constructed for *any point estimate* - For a 95% CI, there is a 95% chance that any given CI contains the true population parameter --- class: last-slide weekend background-image: url("/lectures_assets/end.jpg") background-size: cover
class: slide-zero exclude: ![:live] count: false
---