class: center, middle, inverse, title-slide # Intro to Data ## DATA 606 - Statistics & Probability for Data Analytics ### Jason Bryer, Ph.D. ### February 3, 2021 --- # Announcements * DAACS * I have not sent emails yet to get access to your DAACS results. If you completed it, your are good. * The feedback is on the DAACS website and is tailored to your responses. It is worth taking a few minutes to see what your strengts and areas of improvement are. * Lab 1 and homework 1 is due Sunday. * Labs - When submitting the labs, you can submit just add your answers to the existing document (i.e. you can leave all the text there so you have all the content together). * Link to RStudio cheat sheets: https://rstudio.com/resources/cheatsheets/ --- # Familiarity with Statistical Topics <img src="images/hex/likert.png" class="title-hex"><img src="images/hex/googlesheets4.png" class="title-hex"> ```r likert(stats.results) %>% plot(center = 2.5) ``` <img src="01-Intro_to_Data_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Math Anxiety Survey Scale <img src="images/hex/likert.png" class="title-hex"><img src="images/hex/googlesheets4.png" class="title-hex"> ```r likert(mass.results) %>% plot() ``` <img src="01-Intro_to_Data_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- # Sampling vs. Census A census involves collecting data for the entire population of interest. This is problematic for several reasons, including: * It can be difficult to complete a census: there always seem to be some individuals who are hard to locate or hard to measure. And these difficult-to-find people may have certain characteristics that distinguish them from the rest of the population. * Populations rarely stand still. Even if you could take a census, the population changes constantly, so it’s never possible to get a perfect measure. * Taking a census may be more complex than sampling. Sampling involves measuring a subset of the population of interest, usually randomly. --- # Sampling Bias * **Non-response**: If only a small fraction of the randomly sampled people choose to respond to a survey, the sample may no longer be representative of the population. * **Voluntary response**: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will also not be representative of the population. * **Convenience sample**: Individuals who are easily accessible are more likely to be included in the sample. --- # Simple Random Sampling Randomly select cases from the population, where there is no implied connection between the points that are selected. .center[![Simple Random Sample](images/srs.png)] --- # Stratified Sampling *Strata* are made up of similar observations. We take a simple random sample from each stratum. .center[![](images/stratified.png)] --- # Cluster Sampling *Clusters* are usually not made up of homogeneous observations so we take random samples from random samples of clusters. .center[![](images/cluster.png)] --- # Observational Studies vs. Experiments * **Observational study**: Researchers collect data in a way that does not directly interfere with how the data arise, i.e. they merely “observe”, and can only establish an association between the explanatory and response variables. * **Experiment**: Researchers randomly assign subjects to various treatments in order to establish causal connections between the explanatory and response variables. <center><img src='images/correlation.png' alt='Correlation'><br /><font size='-2'>Source: [XKCD 552 http://xkcd.com/552/](http://xkcd.com/552/)</font></center> <br /> <center><b><font size="+4">Correlation does not imply causation!</font></b></center> --- # Principles of experimental design 1. **Control**: Compare treatment of interest to a control group. 2. **Randomize**: Randomly assign subjects to treatments, and randomly sample from the population whenever possible. 3. **Replicate**: Within a study, replicate by collecting a sufficiently large sample. Or replicate the entire study. 4. **Block**: If there are variables that are known or suspected to affect the response variable, first group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups. Difference between blocking and explanatory variables * Factors are conditions we can impose on the experimental units. * Blocking variables are characteristics that the experimental units come with, that we would like to control for. * Blocking is like stratifying, except used in experimental settings when randomly assigning, as opposed to when sampling. --- # More experimental design terminology... * **Placebo**: fake treatment, often used as the control group for medical studies * **Placebo effect**: experimental units showing improvement simply because they believe they are receiving a special treatment * **Blinding**: when experimental units do not know whether they are in the control or treatment group * **Double-blind**: when both the experimental units and the researchers who interact with the patients do not know who is in the control and who is in the treatment group --- # Random assignment vs. random sampling .center[ <img src='images/random_sample_assignment.png' width='900'> ] --- # Causality .center[ <img src='images/Causality.png' width='900'> ] --- # Randomized Control Trials .pull-left[ <img src="01-Intro_to_Data_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ] --- # Randomized Control Trials .pull-left[ <img src="01-Intro_to_Data_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="01-Intro_to_Data_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- # What if we too a lot of random samples? .code80[ ```r mean_differences <- numeric(500) for(i in 1:length(mean_differences)) { thedata$RCT_Assignment <- sample(c('placebo', 'treatment'), nrow(thedata), replace = TRUE) thedata$RCT_Value <- as.numeric(apply(thedata, 1, FUN = function(x) { return(x[x['RCT_Assignment']]) })) tab.out <- describeBy(thedata$RCT_Value, group = thedata$RCT_Assignment, mat = TRUE, skew = FALSE) mean_differences[i] <- diff(tab.out$mean) } ggplot() + geom_histogram(aes(x = mean_differences), bins = 20, fill = 'grey70') + geom_vline(xintercept = mean(mean_differences), color = 'red', alpha = 0.5) + geom_vline(xintercept = pop.sd * pop.es, color = 'blue', alpha = 0.5) ``` <img src="01-Intro_to_Data_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- class: middle, left # One Minute Paper Complete the one minute paper: https://forms.gle/gY9SeBCPggHEtZYw6 1. What was the most important thing you learned during this class? 2. What important question remains unanswered for you?