Last week’s series of posts on boys, girls and population ratios drew an astonishing 850 or more comments, and they’re still coming in. There’s a lot of stuff worth reading in those comments, but the conversation — which is spread out over four threads and often involves several overlapping subconversations — has become difficult to follow. Fortunately, one of our more insightful commenters has volunteered to write a guest post summarizing what he sees as some of the most important discoveries to come out of those discussions. He has signed his guest post “Tom M”, but in the comments, he is simply “Tom”.
Without further ado:
Steve Landsburg posed a problem in this post at his blog entitled Are You Smarter than Google?. Here was the problem statement:
There’s a certain country where everybody wants to have a son. Therefore each couple keeps having children until they have a boy; then they stop. What fraction of the population is female?
Well, of course, you can’t know for sure, because, by some extraordinary coincidence, the last 100,000 families in a row might have gotten boys on the first try. But in expectation, what fraction of the population is female? In other words, if there were many such countries, what fraction would you expect to observe on average?
As we’ve worked on that problem, one of the points that came up was something called the “extra half boy.”
In this post I’m going to try to summarize what we’ve said about that critter and some related issues.
A key reference, that everybody should read, is What is the expected proportion of girls?, by ‘Thomas Bayes’.
2.1. Flip a coin. The flip yields either ‘G’ or ‘B’ with equal probability.
2.2. A family
Flip a single coin repeatedly, until a B comes up exactly once, then stop. Record the entire string of results in order.
Call that string a ‘completed family.’
Examples of completed families:
Examples of strings that are not completed families:
2.3. A population of N children
Sometimes we may be given a string of flips without information about families. So long as the string terminates in a B, we can always assign the flips to completed families sequentially.
There are 4 corresponding completed families: GGB GB B GB
When the string terminates in a G, we can never decompose it, in the order given, into completed families.
2.4. A ‘completed country’ is a group of completed families
Examples of completed countries:
B (k=1, N=1)
B B (k=2, N=2)
GGGGGGB GGB (k=2, N=10)
GB B GGGGGB B (k=4, N=10)
In order to calcuate national statistics, we need to define some set of completed countries. We can define and index sets of completed countries in different ways. Two of those ways are described below.
2.5. The ‘ensemble of k-family completed countries’
This is the set of all possible completed countries that consist of k completed families.
Example: the set of possible countries for k=1:
This ensemble, like every ensemble of k-family countries for k>0, is infinite.
Note that N varies over this ensemble.
2.6. The ‘ensemble of N-child completed countries’
This is the set of all completed countries that have the same N.
Example: the set of all 8 possible countries for N=4:
For finite N, every ensemble of N-person countries is finite. Here it has 8 members.
Note that k (the number of Bs) varies over this ensemble.
3. Single-family completed countries Steve Landsburg covered this in the post A Big Answer on his blog.
I’ll repeat some of that here.
3.1. The possible bit strings are the ensemble of 1-family countries
k=1 (one family –> one B)
B N=1. 1 case. Probability 1/2.
GB N=2. 1 case. Probability 1/4.
GGB N=3. 1 case. Probability 1/8.
GGGB N=4. 1 case. Probability 1/16.
GGGGB N=5. 1 case. Probability 1/32.
GGGGGB N=6. 1 case. Probability 1/64.
The single B can only come at the end of the string, so for each k there is only one possible string.
3.2. Statistics for the ensemble of 1-family countries
E(B) = 1
E(G) = 1
E(G/(G+B)) = 1 – ln(2) ~ 0.307.
[See A Big Answer ].
4. Multiple-family completed countries for fixed k
Here we’re looking at the ensemble of k-family countries.
4.1. The ensemble of 2-family completed countries
1 case. Probability 1/4
2 cases. Probability 1/8 each
3 cases. Probability 1/16 each
There are infinitely many cases. For each value of N there are N-1 cases.
One B comes at the end of the string. The second B comes anywhere else in the string.
4.2. Statistics for the ensemble of 2-family completed countries
E(B) = 2
E(G) = 2
E(N) = 4
E(G/(G+B)) ~ 0.386294
This result is due to Anshuman. Please see his comment: he provides the analytical treatment as well as the numerical result.
Notice that the approximate formula
E(G/(G+B)) ~ 1 – 1/(2*E(N)) = 1-1/8 = 0.375
is already a half-decent approximation to the exact value, even for k=2.
If instead we only consider the first N-1 flips, Anshuman’s result is modified to
E(G/(G+B)) | first N-1 flips = (1/4)*1*0 + (1/8)*2*(1/2) + (1/16)*3*(2/3) + (1/32)*4*(3/4) + (1/64)*5*(4/5) + …
That simplifies to the familiar series:
E(G/(G+B)) for the first N-1 births with k=2
= 0/4 + 1/8 + 2/16 + 3/32 + 4/64 + 5/128 + …
That is, if we ignore the lastborn child, E(B/(B+G))=1/2. That is true for any k>1.
4.3. The ensemble of k-family completed countries
Begin with the ensemble for the 2-family case, in section 4.1. In this case also, one B comes at the end of the string. Another k-1 Bs come anywhere else. The remaining symbols are Gs. So for each N there are (N-1 k-1) possible strings. Here the notation (a b), a choose b, is the binomial coefficient a!/((a-b)!b!).
4.4. Statistics for the k-family case
Thomas Bayes has given statistics in his report. (Please keep in mind that at publication time Thomas was using the symbol K for the population and N for the number of families. That is the reverse of the convention used here.)
E(B) = E(G) = k.
Over the first N-1 children only, following the argument in section 4.2 above, the formula for general k>1 is
E(G/(G+B)) = sum from N=1 to infinity, 2^(-N)(N-1 k-1)(N-1)/N.
Here (a b) is the binomial coefficient “a choose b” again.
In his piece What is the expected proportion of girls?, Thomas estimates the expected proportion of females in a country where every family wants a boy: roughly, E(G/(G+B)) ~ 1/2 – 1/(2*E(N)). (This is an approximation, better for larger k.)
Because 1/(4k) is half a child, we’ve called this effect an “extra half boy.”
Though it’s interesting to see that, because E(B) = E(G) for an ensemble of k-family countries, the “extra half boy” only appears in the expected ratio, not in the expectations, nor in their ratio.
Nevertheless the impact on the G/(G+B) ratio is entirely associated with the last child. Calculated over the first N-1 children only, E(G/(G+B)) = 1/2 exactly.
Jonathan Campbell pointed out something very important here. That last equation implies that the statistics for the first N-1 children in the population are not those of random coin flips. Here’s the argument:
We know that for all N children E(B)=E(G)=k.
Since the last child is a boy, for the first N-1 children we have E(B) = k-1 and E(G) = k.
For N-1 random flips the expected number of Bs and Gs would be k-1/2 each.
So for fixed k the last flip is fixed to B, but the first N-1 flips are not completely independent.
We’ve excluded sequences of N flips for which B is not equal to k.
In fact, if we know both k and N, the allowable sequences are exactly the (N-1 k-1) ways to place k-1 boys into N-1 slots in the birth sequence. That set of outcomes is drastically reduced from the complete set we’d get
from N-1 coin flips.
You have to be careful thinking about cases where k and N are both fixed. Those ensembles do not resemble strings of coin flips very much at all.
5. The ensemble of completed countries for a fixed N
This is a different way of looking at the problem. The first comment I’ve found suggesting this idea is from Neil in a discussion with Phil Birnbaum.
Instead of fixing the number of families (k), we fix the total population (N). We simply have N-1 random coin flips, followed by a single B due to the national stopping rule.
The number of families is determined by the outcome of the coin flips as described in section 2.3 above.
In this case, by inspection of the ensemble,
E(B) = E(G) + 1
E(B) = N/2 + 1/2
E(G) = N/2 – 1/2
E(k) = N/2 + 1/2 = E(B)
E(G/(G+B)) = 1/2 – 1/(2*N))
A terminal half boy appears, along with a missing half girl.
The half-couple appear in the expectation values for B and G in this case, not only in the expected ratio.
This isn’t surprising: we have a sequence of otherwise random births, terminating with a B.
In effect, by imposing the constraint that completed families must terminate with a B, we put in half a boy and took out half a girl.
Example: for N=4,
BBBB k=4 G/(G+B)=0
BBGB k=3 G/(G+B)=1/4
BGBB k=3 G/(G+B)=1/4
BGGB k=2 G/(G+B)=1/2
GBBB k=3 G/(G+B)=1/4
GBGB k=2 G/(G+B)=1/2
GGBB k=2 G/(G+B)=1/2
GGGB k=1 G/(G+B)=3/4
E(B) = 2.5
E(G) = 1.5
E(k) = E(B) = 2.5
E(G/(G+B)) = 3/8
This case is very easy to perform calculations on.
As Neil put it in a comment,
The point is the probabilities of the different subsequences are determined exactly as you would get by flipping a fair coin. If that was all there was to it, Lubos would be right, the expected girl proportion is exactly equal to 50%—but there is also the boy on the decision coin, which is why Steve is right, it is less than 50%.
(Note that we’re still only talking about the single-generation case only here, however.)
If we want to talk about all possible completed countries, the fixed-N case gives us an easy way to do that.
For every N, E(B)-E(G)=1. So if we average over all N we will get E(B)-E(G)=1 independent of how we weight that average. In that sense, we might say that completed single-generation countries average a half boy more (and a half girl less) than we would get from a totally random sequence of births.
These haven’t taken up much of the discussion.
But Henry, in this comment, pointed out a parallel to a well-known and widely-analyzed gambling “system.” Just a check to make sure that we haven’t got a moneymaking bonanza here, which would be a bad sign.
Henry adds some proofreading a few comments later.
In the single-generation case, it seems clear that the expected ratio of girls is somewhat less than 1/2.
The simplest way to see that may be to consider the set of all completed countries that have N children (Section 5 above).
For any N we get E(B)-E(G)=1.
For every N, E(G/(G+B)) is less than 1/2; in fact E(G/(G+B))=1/2 – 1/(2*N).
For every N the shortfall 1/(2*N) is due entirely to the action of the stopping rule; the remaining births are random coin flips.
When we consider more than one generation, I don’t understand clearly whether the answer to Steve’s problem is 1/2 or something less. It still seems possible that in the absence of a termination, there might be no terminal B and no deviation from 1/2. We certainly haven’t proven that, but I don’t think we’ve disproved it so far.
Thomas Bayes in What is the proportion of girls? provided an outline for a solution that would come out to less than 1/2.
Neil’s interesting argument here is based on the idea that the stopping rule in the problem statement can lead to extinction for some countries in the ensemble. Since most of the ensemble, in fact, consists of families with fewer girls than boys, how can we expect them to produce enough new couples to replace the original parental
Extinction by means of excess males seems to correspond to E(G/(G+B))<1/2, though I haven't seen details.
I hope they're right. The more I work on this problem, the more eagerly I begin to look forward to the extinction of these miserable countries and their misguided inhabitants!
8. Acknowledgements and additional references
Again, a key reference, that everybody should read, is What is the expected proportion of girls?, by ‘Thomas Bayes’. In that document, Thomas outlines his solution to Steve’s problem and covers some of this same material from a different perspective.
(One note: at publication time Thomas used the symbol K for the population and N for the number of families. That is exactly the reverse of the convention I use here. I’m sticking with the convention I define below, for compatibility with Steve’s posts and the majority of comments. But I do apologize for the inconsistency.)
Most of the material here was either posted by Steve Landsburg on his blog or worked out among commenters there, including, in no very reliable order, Thomas Bayes, Jonathan Campbell, Neil, Jonatan, Phil Birnbaum, Pietro Poggi-Corradini, Anshuman and Henry. I apologize to the folks I’ve inevitably missed. Not all these people may agree with everything I’ve written!
When I cite comments, I try to hit the key ones.
But of course other related points, corrections, refinements, rants, etc., are often nearby in the comment stream.