Why does increasing the sample size of coin flips not improve the normal curve approximation?
$begingroup$
I'm reading the Statistics (Freeman, Pisani, Purves) book and I'm trying to reproduce an example where a coin is tossed say 50 times, the number of heads counted and this is repeated say 1,000 times.
First, I kept the number of tosses (sample size) at 1000 and increased the repetitions. The more repetitions, the better the data fit the normal curve.
So next, I tried keeping the number of repetitions fixed at 1,000 and increased the sample size. The larger the sample size, the worst the normal curved seemed to fit the data. This seems to be contradictory to the book example which better approximates the normal curve as the sample size increases.
I wanted to see what would happen if I increased the sample size, but with a larger number of repetitions which were fixed at 10,000. This seems to also contradict the book.
Any ideas what I'm doing wrong?
Code and graphs below.
%matplotlib inline
def plot_hist(num_repetitions, num_tosses):
tosses = np.random.randint(0, 2, size=[num_repetitions, num_tosses])
sums = np.apply_along_axis(lambda a: np.sum(a == 1), 1, tosses)
xmin, xmax = min(sums), max(sums)
lnspc = np.linspace(xmin, xmax, len(sums))
m, s = stats.norm.fit(sums) # get mean and standard deviation
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
bins = np.arange(xmin, xmax) - 0.5
step = int((xmax - xmin)/5)
fig, ax = plt.subplots()
_ = ax.hist(sums, bins, edgecolor='black', linewidth=1.2, density=True)
_ = ax.plot(lnspc, pdf_g, label="Norm", color='red')
_ = ax.set_xticks(bins[::step] + 0.5)
_ = ax.set_title('{:,} tosses - {:,} repetitions'.format(num_tosses, num_repetitions))
1. Experiment with increasing number of repetitions (fixed sample size of 1000)
plot_hist(1000, 1000)
plot_hist(10000, 1000)
plot_hist(100000, 1000)
2. Experiment with increasing sample size (fixed at 1000 repetitions)
plot_hist(1000, 100)
plot_hist(1000, 1000)
plot_hist(1000, 10000)
3. Experiment with increasing sample size (fixed at 10,000 repetitions)
plot_hist(10000, 100)
plot_hist(10000, 1000)
plot_hist(10000, 10000)
plot_hist(10000, 100000)
normal-distribution normal-approximation
$endgroup$
add a comment |
$begingroup$
I'm reading the Statistics (Freeman, Pisani, Purves) book and I'm trying to reproduce an example where a coin is tossed say 50 times, the number of heads counted and this is repeated say 1,000 times.
First, I kept the number of tosses (sample size) at 1000 and increased the repetitions. The more repetitions, the better the data fit the normal curve.
So next, I tried keeping the number of repetitions fixed at 1,000 and increased the sample size. The larger the sample size, the worst the normal curved seemed to fit the data. This seems to be contradictory to the book example which better approximates the normal curve as the sample size increases.
I wanted to see what would happen if I increased the sample size, but with a larger number of repetitions which were fixed at 10,000. This seems to also contradict the book.
Any ideas what I'm doing wrong?
Code and graphs below.
%matplotlib inline
def plot_hist(num_repetitions, num_tosses):
tosses = np.random.randint(0, 2, size=[num_repetitions, num_tosses])
sums = np.apply_along_axis(lambda a: np.sum(a == 1), 1, tosses)
xmin, xmax = min(sums), max(sums)
lnspc = np.linspace(xmin, xmax, len(sums))
m, s = stats.norm.fit(sums) # get mean and standard deviation
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
bins = np.arange(xmin, xmax) - 0.5
step = int((xmax - xmin)/5)
fig, ax = plt.subplots()
_ = ax.hist(sums, bins, edgecolor='black', linewidth=1.2, density=True)
_ = ax.plot(lnspc, pdf_g, label="Norm", color='red')
_ = ax.set_xticks(bins[::step] + 0.5)
_ = ax.set_title('{:,} tosses - {:,} repetitions'.format(num_tosses, num_repetitions))
1. Experiment with increasing number of repetitions (fixed sample size of 1000)
plot_hist(1000, 1000)
plot_hist(10000, 1000)
plot_hist(100000, 1000)
2. Experiment with increasing sample size (fixed at 1000 repetitions)
plot_hist(1000, 100)
plot_hist(1000, 1000)
plot_hist(1000, 10000)
3. Experiment with increasing sample size (fixed at 10,000 repetitions)
plot_hist(10000, 100)
plot_hist(10000, 1000)
plot_hist(10000, 10000)
plot_hist(10000, 100000)
normal-distribution normal-approximation
$endgroup$
2
$begingroup$
You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
$endgroup$
– whuber♦
2 hours ago
$begingroup$
This is useful to know and has increased my knowledge today. Thanks!
$endgroup$
– Chris Snow
1 hour ago
add a comment |
$begingroup$
I'm reading the Statistics (Freeman, Pisani, Purves) book and I'm trying to reproduce an example where a coin is tossed say 50 times, the number of heads counted and this is repeated say 1,000 times.
First, I kept the number of tosses (sample size) at 1000 and increased the repetitions. The more repetitions, the better the data fit the normal curve.
So next, I tried keeping the number of repetitions fixed at 1,000 and increased the sample size. The larger the sample size, the worst the normal curved seemed to fit the data. This seems to be contradictory to the book example which better approximates the normal curve as the sample size increases.
I wanted to see what would happen if I increased the sample size, but with a larger number of repetitions which were fixed at 10,000. This seems to also contradict the book.
Any ideas what I'm doing wrong?
Code and graphs below.
%matplotlib inline
def plot_hist(num_repetitions, num_tosses):
tosses = np.random.randint(0, 2, size=[num_repetitions, num_tosses])
sums = np.apply_along_axis(lambda a: np.sum(a == 1), 1, tosses)
xmin, xmax = min(sums), max(sums)
lnspc = np.linspace(xmin, xmax, len(sums))
m, s = stats.norm.fit(sums) # get mean and standard deviation
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
bins = np.arange(xmin, xmax) - 0.5
step = int((xmax - xmin)/5)
fig, ax = plt.subplots()
_ = ax.hist(sums, bins, edgecolor='black', linewidth=1.2, density=True)
_ = ax.plot(lnspc, pdf_g, label="Norm", color='red')
_ = ax.set_xticks(bins[::step] + 0.5)
_ = ax.set_title('{:,} tosses - {:,} repetitions'.format(num_tosses, num_repetitions))
1. Experiment with increasing number of repetitions (fixed sample size of 1000)
plot_hist(1000, 1000)
plot_hist(10000, 1000)
plot_hist(100000, 1000)
2. Experiment with increasing sample size (fixed at 1000 repetitions)
plot_hist(1000, 100)
plot_hist(1000, 1000)
plot_hist(1000, 10000)
3. Experiment with increasing sample size (fixed at 10,000 repetitions)
plot_hist(10000, 100)
plot_hist(10000, 1000)
plot_hist(10000, 10000)
plot_hist(10000, 100000)
normal-distribution normal-approximation
$endgroup$
I'm reading the Statistics (Freeman, Pisani, Purves) book and I'm trying to reproduce an example where a coin is tossed say 50 times, the number of heads counted and this is repeated say 1,000 times.
First, I kept the number of tosses (sample size) at 1000 and increased the repetitions. The more repetitions, the better the data fit the normal curve.
So next, I tried keeping the number of repetitions fixed at 1,000 and increased the sample size. The larger the sample size, the worst the normal curved seemed to fit the data. This seems to be contradictory to the book example which better approximates the normal curve as the sample size increases.
I wanted to see what would happen if I increased the sample size, but with a larger number of repetitions which were fixed at 10,000. This seems to also contradict the book.
Any ideas what I'm doing wrong?
Code and graphs below.
%matplotlib inline
def plot_hist(num_repetitions, num_tosses):
tosses = np.random.randint(0, 2, size=[num_repetitions, num_tosses])
sums = np.apply_along_axis(lambda a: np.sum(a == 1), 1, tosses)
xmin, xmax = min(sums), max(sums)
lnspc = np.linspace(xmin, xmax, len(sums))
m, s = stats.norm.fit(sums) # get mean and standard deviation
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval
bins = np.arange(xmin, xmax) - 0.5
step = int((xmax - xmin)/5)
fig, ax = plt.subplots()
_ = ax.hist(sums, bins, edgecolor='black', linewidth=1.2, density=True)
_ = ax.plot(lnspc, pdf_g, label="Norm", color='red')
_ = ax.set_xticks(bins[::step] + 0.5)
_ = ax.set_title('{:,} tosses - {:,} repetitions'.format(num_tosses, num_repetitions))
1. Experiment with increasing number of repetitions (fixed sample size of 1000)
plot_hist(1000, 1000)
plot_hist(10000, 1000)
plot_hist(100000, 1000)
2. Experiment with increasing sample size (fixed at 1000 repetitions)
plot_hist(1000, 100)
plot_hist(1000, 1000)
plot_hist(1000, 10000)
3. Experiment with increasing sample size (fixed at 10,000 repetitions)
plot_hist(10000, 100)
plot_hist(10000, 1000)
plot_hist(10000, 10000)
plot_hist(10000, 100000)
normal-distribution normal-approximation
normal-distribution normal-approximation
edited 42 mins ago
osuka_
1032
1032
asked 4 hours ago
Chris SnowChris Snow
198110
198110
2
$begingroup$
You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
$endgroup$
– whuber♦
2 hours ago
$begingroup$
This is useful to know and has increased my knowledge today. Thanks!
$endgroup$
– Chris Snow
1 hour ago
add a comment |
2
$begingroup$
You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
$endgroup$
– whuber♦
2 hours ago
$begingroup$
This is useful to know and has increased my knowledge today. Thanks!
$endgroup$
– Chris Snow
1 hour ago
2
2
$begingroup$
You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
$endgroup$
– whuber♦
2 hours ago
$begingroup$
You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
$endgroup$
– whuber♦
2 hours ago
$begingroup$
This is useful to know and has increased my knowledge today. Thanks!
$endgroup$
– Chris Snow
1 hour ago
$begingroup$
This is useful to know and has increased my knowledge today. Thanks!
$endgroup$
– Chris Snow
1 hour ago
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
In the second case, by increasing the number of tosses, you increase the number of bins a single trial can fall into. While the first case of experiment 2 only has a maximum of 100 bins that can be filled, the last example has 10000 bins. You increased the "resolution" of your experiment by a factor 100(i.e. one bin in your first experiment is now represented by roughly 100 in your second). Of course this means that you would expect to require a factor 100 more data to fill your bins.
$endgroup$
$begingroup$
Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)
) the third experiment now approximates the normal curve.
$endgroup$
– Chris Snow
1 hour ago
add a comment |
$begingroup$
You can think of an individual coin flip as an independent Bernoulli trial. One trial will give you either heads/tails or success/failure, respectively. If you repeat this say 100,000 times, the average number of heads will be very close to 0.5, if the coin is fair.
Now if you increase the number of trials to 1,000 and keep the repetition at 1, you will get a sequence of 1,000 successes/failures and cannot say much about the probability of observing, on average, 500 heads unless you increase the number of repetitions for each of those independent trials. As the number of repetitions increases, you will get a better and better approximation to the normal distribution.
For me it is easier to think of the trials not as “tosses” or “sample sizes” but instead of separate coins and the repetitions as the number of flips of each of those coins. Then it also makes intuitively sense that by increasing the number of coins (or trials), while keeping the number of repetitions constant, the approximation of the data to the normal distribution gets worse.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389590%2fwhy-does-increasing-the-sample-size-of-coin-flips-not-improve-the-normal-curve-a%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
In the second case, by increasing the number of tosses, you increase the number of bins a single trial can fall into. While the first case of experiment 2 only has a maximum of 100 bins that can be filled, the last example has 10000 bins. You increased the "resolution" of your experiment by a factor 100(i.e. one bin in your first experiment is now represented by roughly 100 in your second). Of course this means that you would expect to require a factor 100 more data to fill your bins.
$endgroup$
$begingroup$
Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)
) the third experiment now approximates the normal curve.
$endgroup$
– Chris Snow
1 hour ago
add a comment |
$begingroup$
In the second case, by increasing the number of tosses, you increase the number of bins a single trial can fall into. While the first case of experiment 2 only has a maximum of 100 bins that can be filled, the last example has 10000 bins. You increased the "resolution" of your experiment by a factor 100(i.e. one bin in your first experiment is now represented by roughly 100 in your second). Of course this means that you would expect to require a factor 100 more data to fill your bins.
$endgroup$
$begingroup$
Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)
) the third experiment now approximates the normal curve.
$endgroup$
– Chris Snow
1 hour ago
add a comment |
$begingroup$
In the second case, by increasing the number of tosses, you increase the number of bins a single trial can fall into. While the first case of experiment 2 only has a maximum of 100 bins that can be filled, the last example has 10000 bins. You increased the "resolution" of your experiment by a factor 100(i.e. one bin in your first experiment is now represented by roughly 100 in your second). Of course this means that you would expect to require a factor 100 more data to fill your bins.
$endgroup$
In the second case, by increasing the number of tosses, you increase the number of bins a single trial can fall into. While the first case of experiment 2 only has a maximum of 100 bins that can be filled, the last example has 10000 bins. You increased the "resolution" of your experiment by a factor 100(i.e. one bin in your first experiment is now represented by roughly 100 in your second). Of course this means that you would expect to require a factor 100 more data to fill your bins.
edited 1 hour ago
Chris Snow
198110
198110
answered 2 hours ago
UlfgardUlfgard
584
584
$begingroup$
Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)
) the third experiment now approximates the normal curve.
$endgroup$
– Chris Snow
1 hour ago
add a comment |
$begingroup$
Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)
) the third experiment now approximates the normal curve.
$endgroup$
– Chris Snow
1 hour ago
$begingroup$
Ah, makes sense. By setting the number of bins to 35 (
_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)
) the third experiment now approximates the normal curve.$endgroup$
– Chris Snow
1 hour ago
$begingroup$
Ah, makes sense. By setting the number of bins to 35 (
_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)
) the third experiment now approximates the normal curve.$endgroup$
– Chris Snow
1 hour ago
add a comment |
$begingroup$
You can think of an individual coin flip as an independent Bernoulli trial. One trial will give you either heads/tails or success/failure, respectively. If you repeat this say 100,000 times, the average number of heads will be very close to 0.5, if the coin is fair.
Now if you increase the number of trials to 1,000 and keep the repetition at 1, you will get a sequence of 1,000 successes/failures and cannot say much about the probability of observing, on average, 500 heads unless you increase the number of repetitions for each of those independent trials. As the number of repetitions increases, you will get a better and better approximation to the normal distribution.
For me it is easier to think of the trials not as “tosses” or “sample sizes” but instead of separate coins and the repetitions as the number of flips of each of those coins. Then it also makes intuitively sense that by increasing the number of coins (or trials), while keeping the number of repetitions constant, the approximation of the data to the normal distribution gets worse.
$endgroup$
add a comment |
$begingroup$
You can think of an individual coin flip as an independent Bernoulli trial. One trial will give you either heads/tails or success/failure, respectively. If you repeat this say 100,000 times, the average number of heads will be very close to 0.5, if the coin is fair.
Now if you increase the number of trials to 1,000 and keep the repetition at 1, you will get a sequence of 1,000 successes/failures and cannot say much about the probability of observing, on average, 500 heads unless you increase the number of repetitions for each of those independent trials. As the number of repetitions increases, you will get a better and better approximation to the normal distribution.
For me it is easier to think of the trials not as “tosses” or “sample sizes” but instead of separate coins and the repetitions as the number of flips of each of those coins. Then it also makes intuitively sense that by increasing the number of coins (or trials), while keeping the number of repetitions constant, the approximation of the data to the normal distribution gets worse.
$endgroup$
add a comment |
$begingroup$
You can think of an individual coin flip as an independent Bernoulli trial. One trial will give you either heads/tails or success/failure, respectively. If you repeat this say 100,000 times, the average number of heads will be very close to 0.5, if the coin is fair.
Now if you increase the number of trials to 1,000 and keep the repetition at 1, you will get a sequence of 1,000 successes/failures and cannot say much about the probability of observing, on average, 500 heads unless you increase the number of repetitions for each of those independent trials. As the number of repetitions increases, you will get a better and better approximation to the normal distribution.
For me it is easier to think of the trials not as “tosses” or “sample sizes” but instead of separate coins and the repetitions as the number of flips of each of those coins. Then it also makes intuitively sense that by increasing the number of coins (or trials), while keeping the number of repetitions constant, the approximation of the data to the normal distribution gets worse.
$endgroup$
You can think of an individual coin flip as an independent Bernoulli trial. One trial will give you either heads/tails or success/failure, respectively. If you repeat this say 100,000 times, the average number of heads will be very close to 0.5, if the coin is fair.
Now if you increase the number of trials to 1,000 and keep the repetition at 1, you will get a sequence of 1,000 successes/failures and cannot say much about the probability of observing, on average, 500 heads unless you increase the number of repetitions for each of those independent trials. As the number of repetitions increases, you will get a better and better approximation to the normal distribution.
For me it is easier to think of the trials not as “tosses” or “sample sizes” but instead of separate coins and the repetitions as the number of flips of each of those coins. Then it also makes intuitively sense that by increasing the number of coins (or trials), while keeping the number of repetitions constant, the approximation of the data to the normal distribution gets worse.
edited 2 hours ago
answered 3 hours ago
StefanStefan
3,3911931
3,3911931
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389590%2fwhy-does-increasing-the-sample-size-of-coin-flips-not-improve-the-normal-curve-a%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
$begingroup$
You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
$endgroup$
– whuber♦
2 hours ago
$begingroup$
This is useful to know and has increased my knowledge today. Thanks!
$endgroup$
– Chris Snow
1 hour ago