Numbers and letters for words
A little bit ago I built a twitter bot called: AcroPoll. This is a fairly simple bot as all it does is generate a simple string of characters randomly selected from the alphabet and then tweets them. It’s part of a game that my friends and I have where someone makes up an acronym and others rattle off word that fit. We were finding that it would be picking Z, X and Y just as frequently as the other letters. This posed a problem since every one “knows” there are less words that start with Z, X and Y then the rest. Right? Well sure if you throw Q in there also.
With the help of some friends I’ve compiled a count of words by letter. (Of course this is for US English)
s : 31675 10.856%
c : 25994 8.909%
p : 23936 8.204%
a : 17704 6.068%
m : 17330 5.940%
d : 16463 5.643%
b : 16076 5.510%
r : 15406 5.280%
t : 15127 5.185%
e : 11457 3.927%
h : 11510 3.945%
f : 10441 3.579%
i : 10346 3.546%
g : 9899 3.393%
u : 9272 3.178%
l : 9263 3.175%
o : 9092 3.116%
n : 7445 2.552%
w : 6584 2.257%
v : 4747 1.627%
k : 4484 1.537%
j : 3041 1.042%
z : 1464 0.502%
q : 1446 0.496%
y : 1206 0.413%
x : 355 0.122%
I think what I find interesting is the 3% range. It makes sense when you think about it, but actually seeing it is something else.
I started out with an alphabet much like this:
$alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
But after reworking based on the above data I’ve come up with this:
$alphabet = "SSSSSCCCCCPPPPPAAAAAMMMM DDDDBBBBRRRRTTTTEEEHHHFFFIIIGG GUUULLLOOONNWWVVKKJJZQYX";
I started at the bottom and worked my way up from X to S. Each group level got one vote. So Q only gets one letter in my modified alphabet due to the lack of words that start with it. Now T gets four because it has more. I did group every thing above 6% together because I didn’t want S to come up a lot even though there are more words for it.

Sorry I didn’t have time to help you more with this. A few years ago, I actually did do a lot of study on letter frequencies, but I couldn’t find a quick ref to relative number of words starting with each letter, but I remembered that S is far and away the most common.
Expect to see more Esperanto Acropolls, from me, since the incompatible letters (Q,W,X,Y) are all towards the infrequent end of the list!
Comment by Jon Zuck — October 5, 2008 @ 11:48 am