Andrew Jaswa

Numbers and letters for words

A little bit ago I built a twitter bot called: AcroPoll. This is a fairly simple bot as all it does is generate a simple string of characters randomly selected from the alphabet and then tweets them. It’s part of a game that my friends and I have where someone makes up an acronym and others rattle off word that fit. We were finding that it would be picking Z, X and Y just as frequently as the other letters. This posed a problem since every one “knows” there are less words that start with Z, X and Y then the rest. Right? Well sure if you throw Q in there also.

With the help of some friends I’ve compiled a count of words by letter. (Of course this is for US English)
s : 31675 10.856%
c : 25994 8.909%
p : 23936 8.204%
a : 17704 6.068%
m : 17330 5.940%
d : 16463 5.643%
b : 16076 5.510%
r : 15406 5.280%
t : 15127 5.185%
e : 11457 3.927%
h : 11510 3.945%
f : 10441 3.579%
i : 10346 3.546%
g : 9899 3.393%
u : 9272 3.178%
l : 9263 3.175%
o : 9092 3.116%
n : 7445 2.552%
w : 6584 2.257%
v : 4747 1.627%
k : 4484 1.537%
j : 3041 1.042%
z : 1464 0.502%
q : 1446 0.496%
y : 1206 0.413%
x : 355 0.122%

I think what I find interesting is the 3% range. It makes sense when you think about it, but actually seeing it is something else.

I started out with an alphabet much like this:
$alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
But after reworking based on the above data I’ve come up with this:
$alphabet = "SSSSSCCCCCPPPPPAAAAAMMMM DDDDBBBBRRRRTTTTEEEHHHFFFIIIGG GUUULLLOOONNWWVVKKJJZQYX";

I started at the bottom and worked my way up from X to S. Each group level got one vote. So Q only gets one letter in my modified alphabet due to the lack of words that start with it. Now T gets four because it has more. I did group every thing above 6% together because I didn’t want S to come up a lot even though there are more words for it.

category code, data
tags: , , ,
October 4, 2008
I build crappy websites every day!
Andrew Jaswa
Support me on Amazon