Polling and Sampling

← Central Limit Theorem Next: Confidence Intervals →

In our discussion of random variables, we started with games of chance because we can easily think about all possible outcomes and the probabilities associated with them. We looked at flipping a coin, rolling fair dice, and playing roulette.

Now, we are going to see how random variables relate to gathering information about large populations from small samples. This is also known as inference.

In general, we take a sample to find out about a larger population. We usually don’t have the resources to gather information on everyone in the whole population so instead, we select a small sample and use it to make inferences about the larger population.

Below are some definitions that will be useful when thinking about sampling:

Population: the whole class of individuals about whom the investigator wants to generalize.
Sample: the part of the population the investigator examines.
Inferences: generalizations about the population that come from the sample.
Parameters: numerical facts about the population.
Statistics: estimates of the parameters computed from the sample.

With sampling and polling, here's the Main Idea:

The closer the sample is to the population, the more accurate our inferences will be

There are many ways to select a sample of people to participate in your poll or survey. Consider the following methods of sample selection.

The researcher can hand-pick the sample to resemble the population on all relevant characteristics. This method will have bias since the researcher is choosing who to survey. We know that humans have biases they aren't even aware of so this method is not best.
The researcher can publicly post the survey (online usually) and allow anyone to respond. This is not a good method because the results are going to be biased. People who choose to respond to these online surveys are generally doing it for a reason. These reasons can confound our results.
The researcher can randomly select the sample and allow everyone in the population an equal chance of being chosen. Random sampling (also known as probability sampling) is by far the best way to ensure your sample is as close to the population as possible.

Random Sampling is Best!

Just like randomized controls are best for experiments, random samples are best for surveys. Essentially we want our sample statistics to be as close to the population parameters as possible. That way, our inferences are going to be the most accurate. However, statistics is never having to say you're certain! Inferences are not 100% accurate because of chance error. We can reduce error and bias (but not completely eliminate them) by doing certain things. For example...

The more people we sample, the better! The more people we sample, the closer our sample is to the population. This reduces error in sampling.
We can make sure to randomly draw our sample from the population that we are interested in. This reduces bias!

Note that even though random sampling eliminates selection bias, it does NOT eliminate all types of bias. For instance, it does not eliminate response bias!

Summary

Random selection is most likely to make the sample as like the population as possible because it eliminates selection bias. With enough subjects, random differences average out, not only on the characteristics that the researcher has identified as relevant but on all characteristics, including hidden ones that the researcher might not realize are important.

Polling and Sampling

Random Sampling is Best!

Summary

Example Walk-Throughs with Worksheets

Video 1: k-means Clustering Examples

Video 2: k-means Clustering on New Datasets