Statistical Accuracy and Sample Size

Ealier this year we studied properties of sample's and of statistics computed from them (see the overview of Phase 1 and the sequence of lessons in Phase 2). A question on which we spent quite a bit of time was, "How large must we make our random samples to have confidence that, over the long run, they accurately reflect the population parameter?"

In this most recent phase of the teaching experiment (Phase 3) we examined data as if it were the entire population. When we examined data from DUI arrest rates by state and by region, we thought of the data as if it were all there was to be collected on the subject -- as a population instead of as a sample. When you developed weighted scoring schemes for Places Rated data (to determine your "America's Best City") we did not think of these cities as a sample of all cities in the U.S. Rather, we thought of them as all the cities we would work with -- as a population.

This week we will again investigate the relationship between statistical accuracy and sample size. As a reminder, the idea of statistical accuracy involves two things -- a sampling process and variability among sample statistics. A sampling process is accurate if

when the sampling process is repeated a large number of times and a statistic is calculated for each sample, the statistics are centered around the population parameter
when the sampling process is repeated a large number of times and a statistic is calculated for each sample, variation among the sample statistics is acceptably small.

We will examine samples taken from 100,000 voters in San Diego County and compare the variation among the statistics calculated from the samples and their proximity to the population parameter.

First we will discuss the difference between different types of sampling procedures, some of which we think will produce samples that resemble their populations and some of which we are fairly certain they don't (e.g., Online Polls and Ann Landers' requests for opinions).

Second, we will examine data on 100,000 San Diego County voters (NOTE: Load this only in the computer lab -- the data file is 10.6 megabytes and may take a long time to download over the internet. Also, click here if you need to see how to configure your browser to pass Data Desk files straight on to data desk).

Third, we will take many samples of various sizes and keep track of each size's distribution, to get a better understanding of the accuracy we can expect when taking samples of that size.

Postscript: After you generated your samples and completed your worksheets we entered the data you collected into a Data Desk file and generated graphs that depicted the accuracy of the graphs. We noticed that larger sample sizes tended to be more accurate (centered on the population statistic with less variability than smaller sample sizes). But we also noticed that samples of size 3000 are hardly more accurate than samples of size 1500. So, if we had to pay for these samples, using samples of size 1500 over the long run would end up being about as accurate as using samples of size 3000 and cost a lot less to collect. Also, it could be that, if optimal accuracy doesn't matter, using smaller sample sizes might be just as useful but a lot cheaper than larger sample sizes.