Why does increasing the sample size of coin flips not improve the normal curve approximation?












2












$begingroup$


I'm reading the Statistics (Freeman, Pisani, Purves) book and I'm trying to reproduce an example where a coin is tossed say 50 times, the number of heads counted and this is repeated say 1,000 times.




  1. First, I kept the number of tosses (sample size) at 1000 and increased the repetitions. The more repetitions, the better the data fit the normal curve.


  2. So next, I tried keeping the number of repetitions fixed at 1,000 and increased the sample size. The larger the sample size, the worst the normal curved seemed to fit the data. This seems to be contradictory to the book example which better approximates the normal curve as the sample size increases.


  3. I wanted to see what would happen if I increased the sample size, but with a larger number of repetitions which were fixed at 10,000. This seems to also contradict the book.



Any ideas what I'm doing wrong?



Code and graphs below.



%matplotlib inline

def plot_hist(num_repetitions, num_tosses):
tosses = np.random.randint(0, 2, size=[num_repetitions, num_tosses])
sums = np.apply_along_axis(lambda a: np.sum(a == 1), 1, tosses)

xmin, xmax = min(sums), max(sums)
lnspc = np.linspace(xmin, xmax, len(sums))

m, s = stats.norm.fit(sums) # get mean and standard deviation
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval

bins = np.arange(xmin, xmax) - 0.5
step = int((xmax - xmin)/5)

fig, ax = plt.subplots()
_ = ax.hist(sums, bins, edgecolor='black', linewidth=1.2, density=True)
_ = ax.plot(lnspc, pdf_g, label="Norm", color='red')
_ = ax.set_xticks(bins[::step] + 0.5)
_ = ax.set_title('{:,} tosses - {:,} repetitions'.format(num_tosses, num_repetitions))


1. Experiment with increasing number of repetitions (fixed sample size of 1000)



plot_hist(1000, 1000)


enter image description here



plot_hist(10000, 1000)


enter image description here



plot_hist(100000, 1000)


enter image description here



2. Experiment with increasing sample size (fixed at 1000 repetitions)



plot_hist(1000, 100)


enter image description here



plot_hist(1000, 1000)


enter image description here



plot_hist(1000, 10000)


enter image description here



3. Experiment with increasing sample size (fixed at 10,000 repetitions)



plot_hist(10000, 100)


enter image description here



plot_hist(10000, 1000)


enter image description here



plot_hist(10000, 10000)


enter image description here



plot_hist(10000, 100000)


enter image description here










share|cite|improve this question











$endgroup$








  • 2




    $begingroup$
    You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
    $endgroup$
    – whuber
    2 hours ago










  • $begingroup$
    This is useful to know and has increased my knowledge today. Thanks!
    $endgroup$
    – Chris Snow
    1 hour ago
















2












$begingroup$


I'm reading the Statistics (Freeman, Pisani, Purves) book and I'm trying to reproduce an example where a coin is tossed say 50 times, the number of heads counted and this is repeated say 1,000 times.




  1. First, I kept the number of tosses (sample size) at 1000 and increased the repetitions. The more repetitions, the better the data fit the normal curve.


  2. So next, I tried keeping the number of repetitions fixed at 1,000 and increased the sample size. The larger the sample size, the worst the normal curved seemed to fit the data. This seems to be contradictory to the book example which better approximates the normal curve as the sample size increases.


  3. I wanted to see what would happen if I increased the sample size, but with a larger number of repetitions which were fixed at 10,000. This seems to also contradict the book.



Any ideas what I'm doing wrong?



Code and graphs below.



%matplotlib inline

def plot_hist(num_repetitions, num_tosses):
tosses = np.random.randint(0, 2, size=[num_repetitions, num_tosses])
sums = np.apply_along_axis(lambda a: np.sum(a == 1), 1, tosses)

xmin, xmax = min(sums), max(sums)
lnspc = np.linspace(xmin, xmax, len(sums))

m, s = stats.norm.fit(sums) # get mean and standard deviation
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval

bins = np.arange(xmin, xmax) - 0.5
step = int((xmax - xmin)/5)

fig, ax = plt.subplots()
_ = ax.hist(sums, bins, edgecolor='black', linewidth=1.2, density=True)
_ = ax.plot(lnspc, pdf_g, label="Norm", color='red')
_ = ax.set_xticks(bins[::step] + 0.5)
_ = ax.set_title('{:,} tosses - {:,} repetitions'.format(num_tosses, num_repetitions))


1. Experiment with increasing number of repetitions (fixed sample size of 1000)



plot_hist(1000, 1000)


enter image description here



plot_hist(10000, 1000)


enter image description here



plot_hist(100000, 1000)


enter image description here



2. Experiment with increasing sample size (fixed at 1000 repetitions)



plot_hist(1000, 100)


enter image description here



plot_hist(1000, 1000)


enter image description here



plot_hist(1000, 10000)


enter image description here



3. Experiment with increasing sample size (fixed at 10,000 repetitions)



plot_hist(10000, 100)


enter image description here



plot_hist(10000, 1000)


enter image description here



plot_hist(10000, 10000)


enter image description here



plot_hist(10000, 100000)


enter image description here










share|cite|improve this question











$endgroup$








  • 2




    $begingroup$
    You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
    $endgroup$
    – whuber
    2 hours ago










  • $begingroup$
    This is useful to know and has increased my knowledge today. Thanks!
    $endgroup$
    – Chris Snow
    1 hour ago














2












2








2





$begingroup$


I'm reading the Statistics (Freeman, Pisani, Purves) book and I'm trying to reproduce an example where a coin is tossed say 50 times, the number of heads counted and this is repeated say 1,000 times.




  1. First, I kept the number of tosses (sample size) at 1000 and increased the repetitions. The more repetitions, the better the data fit the normal curve.


  2. So next, I tried keeping the number of repetitions fixed at 1,000 and increased the sample size. The larger the sample size, the worst the normal curved seemed to fit the data. This seems to be contradictory to the book example which better approximates the normal curve as the sample size increases.


  3. I wanted to see what would happen if I increased the sample size, but with a larger number of repetitions which were fixed at 10,000. This seems to also contradict the book.



Any ideas what I'm doing wrong?



Code and graphs below.



%matplotlib inline

def plot_hist(num_repetitions, num_tosses):
tosses = np.random.randint(0, 2, size=[num_repetitions, num_tosses])
sums = np.apply_along_axis(lambda a: np.sum(a == 1), 1, tosses)

xmin, xmax = min(sums), max(sums)
lnspc = np.linspace(xmin, xmax, len(sums))

m, s = stats.norm.fit(sums) # get mean and standard deviation
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval

bins = np.arange(xmin, xmax) - 0.5
step = int((xmax - xmin)/5)

fig, ax = plt.subplots()
_ = ax.hist(sums, bins, edgecolor='black', linewidth=1.2, density=True)
_ = ax.plot(lnspc, pdf_g, label="Norm", color='red')
_ = ax.set_xticks(bins[::step] + 0.5)
_ = ax.set_title('{:,} tosses - {:,} repetitions'.format(num_tosses, num_repetitions))


1. Experiment with increasing number of repetitions (fixed sample size of 1000)



plot_hist(1000, 1000)


enter image description here



plot_hist(10000, 1000)


enter image description here



plot_hist(100000, 1000)


enter image description here



2. Experiment with increasing sample size (fixed at 1000 repetitions)



plot_hist(1000, 100)


enter image description here



plot_hist(1000, 1000)


enter image description here



plot_hist(1000, 10000)


enter image description here



3. Experiment with increasing sample size (fixed at 10,000 repetitions)



plot_hist(10000, 100)


enter image description here



plot_hist(10000, 1000)


enter image description here



plot_hist(10000, 10000)


enter image description here



plot_hist(10000, 100000)


enter image description here










share|cite|improve this question











$endgroup$




I'm reading the Statistics (Freeman, Pisani, Purves) book and I'm trying to reproduce an example where a coin is tossed say 50 times, the number of heads counted and this is repeated say 1,000 times.




  1. First, I kept the number of tosses (sample size) at 1000 and increased the repetitions. The more repetitions, the better the data fit the normal curve.


  2. So next, I tried keeping the number of repetitions fixed at 1,000 and increased the sample size. The larger the sample size, the worst the normal curved seemed to fit the data. This seems to be contradictory to the book example which better approximates the normal curve as the sample size increases.


  3. I wanted to see what would happen if I increased the sample size, but with a larger number of repetitions which were fixed at 10,000. This seems to also contradict the book.



Any ideas what I'm doing wrong?



Code and graphs below.



%matplotlib inline

def plot_hist(num_repetitions, num_tosses):
tosses = np.random.randint(0, 2, size=[num_repetitions, num_tosses])
sums = np.apply_along_axis(lambda a: np.sum(a == 1), 1, tosses)

xmin, xmax = min(sums), max(sums)
lnspc = np.linspace(xmin, xmax, len(sums))

m, s = stats.norm.fit(sums) # get mean and standard deviation
pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval

bins = np.arange(xmin, xmax) - 0.5
step = int((xmax - xmin)/5)

fig, ax = plt.subplots()
_ = ax.hist(sums, bins, edgecolor='black', linewidth=1.2, density=True)
_ = ax.plot(lnspc, pdf_g, label="Norm", color='red')
_ = ax.set_xticks(bins[::step] + 0.5)
_ = ax.set_title('{:,} tosses - {:,} repetitions'.format(num_tosses, num_repetitions))


1. Experiment with increasing number of repetitions (fixed sample size of 1000)



plot_hist(1000, 1000)


enter image description here



plot_hist(10000, 1000)


enter image description here



plot_hist(100000, 1000)


enter image description here



2. Experiment with increasing sample size (fixed at 1000 repetitions)



plot_hist(1000, 100)


enter image description here



plot_hist(1000, 1000)


enter image description here



plot_hist(1000, 10000)


enter image description here



3. Experiment with increasing sample size (fixed at 10,000 repetitions)



plot_hist(10000, 100)


enter image description here



plot_hist(10000, 1000)


enter image description here



plot_hist(10000, 10000)


enter image description here



plot_hist(10000, 100000)


enter image description here







normal-distribution normal-approximation






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 42 mins ago









osuka_

1032




1032










asked 4 hours ago









Chris SnowChris Snow

198110




198110








  • 2




    $begingroup$
    You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
    $endgroup$
    – whuber
    2 hours ago










  • $begingroup$
    This is useful to know and has increased my knowledge today. Thanks!
    $endgroup$
    – Chris Snow
    1 hour ago














  • 2




    $begingroup$
    You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
    $endgroup$
    – whuber
    2 hours ago










  • $begingroup$
    This is useful to know and has increased my knowledge today. Thanks!
    $endgroup$
    – Chris Snow
    1 hour ago








2




2




$begingroup$
You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
$endgroup$
– whuber
2 hours ago




$begingroup$
You need a better histogram plotter--this one is particularly poor and it's creating artifacts. Better yet, compare distributions with a probability plot rather than a histogram.
$endgroup$
– whuber
2 hours ago












$begingroup$
This is useful to know and has increased my knowledge today. Thanks!
$endgroup$
– Chris Snow
1 hour ago




$begingroup$
This is useful to know and has increased my knowledge today. Thanks!
$endgroup$
– Chris Snow
1 hour ago










2 Answers
2






active

oldest

votes


















4












$begingroup$

In the second case, by increasing the number of tosses, you increase the number of bins a single trial can fall into. While the first case of experiment 2 only has a maximum of 100 bins that can be filled, the last example has 10000 bins. You increased the "resolution" of your experiment by a factor 100(i.e. one bin in your first experiment is now represented by roughly 100 in your second). Of course this means that you would expect to require a factor 100 more data to fill your bins.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)) the third experiment now approximates the normal curve.
    $endgroup$
    – Chris Snow
    1 hour ago



















3












$begingroup$

You can think of an individual coin flip as an independent Bernoulli trial. One trial will give you either heads/tails or success/failure, respectively. If you repeat this say 100,000 times, the average number of heads will be very close to 0.5, if the coin is fair.



Now if you increase the number of trials to 1,000 and keep the repetition at 1, you will get a sequence of 1,000 successes/failures and cannot say much about the probability of observing, on average, 500 heads unless you increase the number of repetitions for each of those independent trials. As the number of repetitions increases, you will get a better and better approximation to the normal distribution.



For me it is easier to think of the trials not as “tosses” or “sample sizes” but instead of separate coins and the repetitions as the number of flips of each of those coins. Then it also makes intuitively sense that by increasing the number of coins (or trials), while keeping the number of repetitions constant, the approximation of the data to the normal distribution gets worse.






share|cite|improve this answer











$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "65"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389590%2fwhy-does-increasing-the-sample-size-of-coin-flips-not-improve-the-normal-curve-a%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    4












    $begingroup$

    In the second case, by increasing the number of tosses, you increase the number of bins a single trial can fall into. While the first case of experiment 2 only has a maximum of 100 bins that can be filled, the last example has 10000 bins. You increased the "resolution" of your experiment by a factor 100(i.e. one bin in your first experiment is now represented by roughly 100 in your second). Of course this means that you would expect to require a factor 100 more data to fill your bins.






    share|cite|improve this answer











    $endgroup$













    • $begingroup$
      Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)) the third experiment now approximates the normal curve.
      $endgroup$
      – Chris Snow
      1 hour ago
















    4












    $begingroup$

    In the second case, by increasing the number of tosses, you increase the number of bins a single trial can fall into. While the first case of experiment 2 only has a maximum of 100 bins that can be filled, the last example has 10000 bins. You increased the "resolution" of your experiment by a factor 100(i.e. one bin in your first experiment is now represented by roughly 100 in your second). Of course this means that you would expect to require a factor 100 more data to fill your bins.






    share|cite|improve this answer











    $endgroup$













    • $begingroup$
      Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)) the third experiment now approximates the normal curve.
      $endgroup$
      – Chris Snow
      1 hour ago














    4












    4








    4





    $begingroup$

    In the second case, by increasing the number of tosses, you increase the number of bins a single trial can fall into. While the first case of experiment 2 only has a maximum of 100 bins that can be filled, the last example has 10000 bins. You increased the "resolution" of your experiment by a factor 100(i.e. one bin in your first experiment is now represented by roughly 100 in your second). Of course this means that you would expect to require a factor 100 more data to fill your bins.






    share|cite|improve this answer











    $endgroup$



    In the second case, by increasing the number of tosses, you increase the number of bins a single trial can fall into. While the first case of experiment 2 only has a maximum of 100 bins that can be filled, the last example has 10000 bins. You increased the "resolution" of your experiment by a factor 100(i.e. one bin in your first experiment is now represented by roughly 100 in your second). Of course this means that you would expect to require a factor 100 more data to fill your bins.







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited 1 hour ago









    Chris Snow

    198110




    198110










    answered 2 hours ago









    UlfgardUlfgard

    584




    584












    • $begingroup$
      Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)) the third experiment now approximates the normal curve.
      $endgroup$
      – Chris Snow
      1 hour ago


















    • $begingroup$
      Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)) the third experiment now approximates the normal curve.
      $endgroup$
      – Chris Snow
      1 hour ago
















    $begingroup$
    Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)) the third experiment now approximates the normal curve.
    $endgroup$
    – Chris Snow
    1 hour ago




    $begingroup$
    Ah, makes sense. By setting the number of bins to 35 (_ = ax.hist(sums, bins=35, edgecolor='black', linewidth=1.2, density=True)) the third experiment now approximates the normal curve.
    $endgroup$
    – Chris Snow
    1 hour ago













    3












    $begingroup$

    You can think of an individual coin flip as an independent Bernoulli trial. One trial will give you either heads/tails or success/failure, respectively. If you repeat this say 100,000 times, the average number of heads will be very close to 0.5, if the coin is fair.



    Now if you increase the number of trials to 1,000 and keep the repetition at 1, you will get a sequence of 1,000 successes/failures and cannot say much about the probability of observing, on average, 500 heads unless you increase the number of repetitions for each of those independent trials. As the number of repetitions increases, you will get a better and better approximation to the normal distribution.



    For me it is easier to think of the trials not as “tosses” or “sample sizes” but instead of separate coins and the repetitions as the number of flips of each of those coins. Then it also makes intuitively sense that by increasing the number of coins (or trials), while keeping the number of repetitions constant, the approximation of the data to the normal distribution gets worse.






    share|cite|improve this answer











    $endgroup$


















      3












      $begingroup$

      You can think of an individual coin flip as an independent Bernoulli trial. One trial will give you either heads/tails or success/failure, respectively. If you repeat this say 100,000 times, the average number of heads will be very close to 0.5, if the coin is fair.



      Now if you increase the number of trials to 1,000 and keep the repetition at 1, you will get a sequence of 1,000 successes/failures and cannot say much about the probability of observing, on average, 500 heads unless you increase the number of repetitions for each of those independent trials. As the number of repetitions increases, you will get a better and better approximation to the normal distribution.



      For me it is easier to think of the trials not as “tosses” or “sample sizes” but instead of separate coins and the repetitions as the number of flips of each of those coins. Then it also makes intuitively sense that by increasing the number of coins (or trials), while keeping the number of repetitions constant, the approximation of the data to the normal distribution gets worse.






      share|cite|improve this answer











      $endgroup$
















        3












        3








        3





        $begingroup$

        You can think of an individual coin flip as an independent Bernoulli trial. One trial will give you either heads/tails or success/failure, respectively. If you repeat this say 100,000 times, the average number of heads will be very close to 0.5, if the coin is fair.



        Now if you increase the number of trials to 1,000 and keep the repetition at 1, you will get a sequence of 1,000 successes/failures and cannot say much about the probability of observing, on average, 500 heads unless you increase the number of repetitions for each of those independent trials. As the number of repetitions increases, you will get a better and better approximation to the normal distribution.



        For me it is easier to think of the trials not as “tosses” or “sample sizes” but instead of separate coins and the repetitions as the number of flips of each of those coins. Then it also makes intuitively sense that by increasing the number of coins (or trials), while keeping the number of repetitions constant, the approximation of the data to the normal distribution gets worse.






        share|cite|improve this answer











        $endgroup$



        You can think of an individual coin flip as an independent Bernoulli trial. One trial will give you either heads/tails or success/failure, respectively. If you repeat this say 100,000 times, the average number of heads will be very close to 0.5, if the coin is fair.



        Now if you increase the number of trials to 1,000 and keep the repetition at 1, you will get a sequence of 1,000 successes/failures and cannot say much about the probability of observing, on average, 500 heads unless you increase the number of repetitions for each of those independent trials. As the number of repetitions increases, you will get a better and better approximation to the normal distribution.



        For me it is easier to think of the trials not as “tosses” or “sample sizes” but instead of separate coins and the repetitions as the number of flips of each of those coins. Then it also makes intuitively sense that by increasing the number of coins (or trials), while keeping the number of repetitions constant, the approximation of the data to the normal distribution gets worse.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited 2 hours ago

























        answered 3 hours ago









        StefanStefan

        3,3911931




        3,3911931






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389590%2fwhy-does-increasing-the-sample-size-of-coin-flips-not-improve-the-normal-curve-a%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Olav Thon

            Waikiki

            Tårekanal