by Lynne B. Hare
One day during my early years as a statistician in the food industry, I glanced up to see Ralph in my office doorway. It was obvious his boss had sent him. Statisticians can always (well, given the nature of uncertainty, I should say almost always) tell when someone has been sent: The ears are down and the tail is between the legs. Avoiding eye contact, Ralph asked how many samples I usually took for tomato sauce.
Nothing in my statistical education prepared me for that question. I must admit I sat there, mouth agape, for what must have seemed like eternity. Finally, I asked him what he wanted to know about his tomato sauce. He explained that the production department had a warehouse full of tomato sauce and his staff thought some of it might contain accidental inclusions. I thought, what are the chances I have eaten an accidental inclusion? But instead I asked what he meant. He said it was possible some onions got into the tomatoes.
Somewhat relieved, I thought this situation would give the marketing department some ideas about product line extensions. But Ralph probably wanted to know how many samples he had to take to learn how bad the problem was. He probably also wanted to know how much onion was in the cans that contained onions. There were other things he might have wanted to know, such as how he should sample to isolate product suspected to contain onions, but let's stick to the first two questions.
Percentage With Onions
Here's our situation: What percentage of the tomato sauce cans have onion in them? How many samples should Ralph take to find out? It depends. Dang! It depends on how precisely Ralph wants to estimate the percentage. Ralph might say he wants to know exactly. We know better: There's no such thing as exactly in sampling. Unless Ralph wants to open all the cans, he'll never know exactly.
If you read your favorite statistics textbook, you will learn you might be safe assuming a binomial distribution for an onion, no-onion situation. Further you will learn the normal distribution is a good approximation to the binomial distribution for reasonably large sample sizes, say 50 and above. An approximate 95% confidence interval for the proportion, p, of cans with onion in them is
where n is the number of samples taken. Because n is in the denominator, Ralph can make the width of the confidence interval as small as he wants. He should recognize, however, decreased widths are increasingly expensive in cost and effort.
This expression generalizes to
where Z/2 is the standard normal deviate corresponding to /2. In this usage, is the two-tailed probability the confidence interval fails to cover the true proportion of cans with onions. If Ralph wants higher confidence, decreases; if he wants lower confidence, it increases. Table 1 lists some Z values that correspond to popular confidence figures. Everybody's favorite seems to be 95% confidence. Some people don't care what they say as long as they can be 95% confident of it.
But back to Ralph's question. He just wants to know how many samples he has to take. If he has no idea about the proportion of cans containing onions, he should guess a value of 0.5 for p. This is because the expression under the radical is at its highest numerical value when p = 0.5. Therefore, the uncertainty is at its highest. Then, if he wants the half-width, W, of the 95% confidence interval to be 1%, or 0.01, he should set
The half-width is a good idea. It is used so people like Ralph can say, "Yeah, we know it's plus or minus 1%," with a great deal of authority.
Solving for n and using a generalized form, we have
Ralph tumbles these numbers and is shocked to discover he would have to take 9,604 samples of tomato sauce. He has an accusing look on his face as if this were somehow part of a statistical plot to get people to do more work. It is strange how objectives can quickly change when sample sizes get large. Ralph wonders what I can do to help reduce this unrealistically high result. I explain none of this is my fault, but I'll be happy to discuss some trade-offs. Well, that's better.
Is it likely the proportion of cans with onions in them is as high as 0.5, or 50%? No. What is the highest you think it might be? Well, it couldn't be higher than 0.1 or 10%. OK, in that case we have
rounded up to 3,458 because it is hard to look at a fraction of a can. Ralph breathed a sigh of relief. It is still a lot of work, but at least it's not as bad as 9,604.
Not to rain on his parade, but I wonder how he knows it can't be higher than 10%. Sometimes the things we know aren't so.
Percentage of Onions in Cans Containing Them
Suppose Ralph's question is different. Suppose he has already isolated a large collection of cans that contain onions and he wants to estimate the percentage of onions, by weight, in the cans that actually contain onions. How many cans should the technician weigh? Again, it depends. Dang, again!
For large sample sizes and assuming onion amounts follow a normal, or bell-shaped, distribution, the confidence interval about the true mean, µ , estimated by the sample mean, , is
where s is the standard deviation of onion amounts taken over all the cans in the sample, n is the number of cans sampled and Z/2 is the constant corresponding to the desired confidence level as shown in Table 1 (p. 73). The sample standard deviation, s, is actually an estimate of the true standard deviation, . When sample sizes are small, the student "t" deviate, corresponding to n - 1 degrees of freedom, is used in place of Z. Again, consult your favorite statistics text for the details.
The answer to Ralph's question depends on the standard deviation. Of course, if he knows that, he probably also knows roughly the amount of onion in the cans containing it. Still, sometimes people do a little prework and, sure enough, Ralph has some data we can use to calculate an estimate of the standard deviation. It is around 16 grams.
Next, I want to know how closely Ralph needs to estimate the mean amount. He says ±4 grams with 95% confidence (of course.). Setting W as the half-width of the confidence interval around the estimate of the mean gives
and a little algebra shows
So, for Ralph's estimate, we have
or 62 cans.
How To Sample
Ralph isn't very impressed with the formulas or the calculations. While he never says so, I can tell he is thinking, "Well, of course you can do that. You're a statistician, and that's what you're supposed to do. That's why I was sent, er, why I came to see you. And besides, there's part of this that doesn't make any sense. You never said where the samples should come from, and I have to believe that makes a big difference." Right, it does make a big difference.
Hand someone a box, tell him to go get some samples, and he'll get the samples closest to him. Those samples are not likely to contain the same proportion of onions as the entire production. To be fair, shouldn't we have the same number of samples from each of the three shifts? Or the same number from each pallet? The greatest concern is the samples must represent the segment of production whose disposition will be decided. As always, the key is to take the right amount of the right kind of data.
One way to guarantee representation in the long run is to sample using a table of random numbers. Select, for example, a seven-digit random number. Let the first three digits denote the pallet number, the fourth digit the layer on the pallet, the fifth digit the shipper on the layer and the sixth and seventh digits the can within the shipper. Upon hearing this, Ralph says, "In the long run, we're all dead. Suppose your random numbers want all my samples to come from the same area? It could happen, you know. How do I explain that to the boss?" You don't.
A systematic random sample might be more suitable. Deliberately choose every fourth pallet in production order, for example, and then use a random number table to help choose the layer, shipper and can.
"Would that do, Ralph?"
"Yeah. Why didn't you tell me that when I asked you how many samples you usually take for tomato sauce?"
As a general rule of thumb, it is safe to use the normal approximation to the binomial distribution when np >= 5 and n (1 - p) >= 5.
LYNNE B. HARE is program director of technology guidance at Kraft Foods Research in East Hanover, NJ. He received a doctorate in statistics from Rutgers University, New Brunswick, NJ. Hare is a past chairman of ASQ's Statistics Division and an ASQ Fellow.