Wednesday, April 10, 2013

A Consumer's Guide to Statistics: Sampling Bias

Over the next while - probably a long while, since I have yet to even think of all the topics I'm going to cover - I'm going to write a series of articles aimed at helping normal people understand statistics and, more importantly, when they're not to be trusted.

According to my studies, statisticians are not normal people, so this will not be aimed at them. Results are inconclusive as to whether or not they are people at all, however, so I will still try to not offend their sensibilities too much.

What's A Sample?

As you might have guessed from a title, this article is all about samples. So, what's a sample?

Formally, the sample is the subset of the population from which you have gathered your data for a given statistical experiment. Sometimes the sample is the population, but this is usually not the case, mostly for silly reasons like 'expenses' and 'logistics.'


That definition is probably not very useful to you right off the bat, so we'll break it down with an example so we can all understand what it is we're talking about.

Say you want to construct an experiment to determine how much the average college student spends on food in a given week.

Here your population is 'all college students.' This is the total group from which you could collect data from for your experiment.

Now, let's say you hand out a survey to collect data. Everyone who takes your survey is part of your sample - the total group from which you have collected data from for your experiment.

Essentially it boils down to a population being the group of all things that you want to know something about, while the sample is a subset of that group from which you have gathered some data.

Sampling Error

So you have your question from above, you've handed out some surveys, gotten back your results, tallied your results, performed some fancy black magic, and, quite suddenly, have a number that represents exactly what the average college student spends on food each week!

Right? I mean, it has to be. You did some statistics on it, you did them all correctly, and you have data, so your number has to be valid!

Unfortunately, statistics isn't quite so simple. Because you couldn't possibly have handed out your survey to every college student - or even every college student in America - you have some amount of sampling error.

Sampling error occurs to some degree whenever you have a sample that is not the entirety of your population.

Think of it like this - you have some object you want to see, but you can only look at certain parts of the object. If you could look at all of it, it would be quite clear what the object is, but since you can only look at parts of it, you are uncertain as to the true nature of the object, and the less of it you can see, the more uncertain you become.

Unfortunately, here are very few instances where no sampling error occurs, as there are very few interesting statistical problems where one could feasibly collect information on the entirety of the relevant population.

For instance, in the above problem you would have to not only get your survey to every college student, you would also have to ensure responses, which can sometimes run you more than $100 per survey in incentives.

Sampling Bias

But wait, there's more!

Sampling error is largely intrinsic, and is difficult to eliminate even in well setup experiments. It is so expected that much of modern statistics is built around calculating and mitigating it.

Sampling bias, on the other hand, is a result of poor experimental design, and can cause some quite massive shifts in your results, even if your sample is a very large portion of the population.

In order for a sample to not be biased, the selection of the sample must be entirely random, and the random selection process must include the entire population. In the above example, if you hand out surveys only to college students at a particular example and treat the sample as representative of the population of all college students, you will be sorely disappointed at the accuracy of your results.

Here are some good examples of common sampling bias problems with surveys...

If you send out a survey but provide no incentive to complete it, you are often causing sampling bias - those who are truly interested in the survey and/or possible actions as a result are more likely to complete it.

If you let individuals choose whether or not they are part of the sample, you have a similar problem - those who are truly interested are more likely to partake.

If you open participation to only a subset of the population, you again have a similar problem.

There are, obviously, cases where these problems are largely mitigated because of what you want out of the survey - advertising research is a great example of this, as you are mostly concerned with people are interested in your product - but these and many similar experimental design flaws cause many of the problems with 'poor' statistics that you see every day.

My Favorite Example

Your watching television. Say Fox, or NBC. It's election season. A poll result comes up, and you glance over it. On Fox, the Republican presidential candidate is in the lead, while NBC has the Democratic candidate for president ahead by a large margin.

Do you trust these statistics?

The answer? Probably not. Not even a little bit. They may be right by coincidence, but polls put up on television - especially on larger stations - tend to be polls collected by taking calls from listeners, not from attempting to make or find a representative sample.

Since Fox is, largely, listened to by conservatives, you'd expect them to be more in favor of the Republican candidate. Similarly, NBC's listeners are mostly Democrats, and they would be expected to favor the Democratic candidate.

Moving on to, say, a Gallup Poll, they have a similar result, but much closer - the Democratic candidate is in the lead by a small margin.

Gallup polls - and many other political polls - have repeatedly shown to be very, very accurate. How do they manage it?

Well, they take a representative sample. While they may not take a completely random sample they have, over the years, determined what the composition of the sample for such a poll needs to be so that they can accurately predict the results of an election.

Final Thoughts

If you look at some statistics, one of your very first thoughts should be "What sort of sample did they take?" Once you can determine if the sample is even remotely representative, you can decide what degree of accuracy you can give to the numbers you're being shown.

Alternatively, if you're performing some black magic of your own, don't take a small sample and attempt to make it representative of a large population unless you're very certain of yourself. Taking our example from above, an easier population to deal with would've been 'students at the local university.'

And now I'm done. Hopefully you all learned a little something, and if you didn't, I can't be certain that you're a normal person. If any of you have ideas or suggestions as to what other things I should be writing about in this series, I'd love to hear them and, as always, you can follow over to the right and put your comments down below.

EDIT: Second part of this series is over here.

No comments:

Post a Comment