We’ve all seen tag clouds. I even considered putting one on my site, but I realized that tag clouds don’t really look nice. The other day I discovered Wordle.
Wordle is a toy for generating “word clouds” from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text. You can tweak your clouds with different fonts, layouts, and color schemes.
I started playing about with it and was quite impressed. I put in the RSS feed for this site and my delicious bookmarks.
andrewjaswa.com

ajaswa’s delicious

They way Wordle displays the collected words is rather interesting. More so then just a straight tag cloud. Now to create something like this on the fly and embed it on a site…
February 8, 2009
A little bit ago I built a twitter bot called: AcroPoll. This is a fairly simple bot as all it does is generate a simple string of characters randomly selected from the alphabet and then tweets them. It’s part of a game that my friends and I have where someone makes up an acronym and others rattle off word that fit. We were finding that it would be picking Z, X and Y just as frequently as the other letters. This posed a problem since every one “knows” there are less words that start with Z, X and Y then the rest. Right? Well sure if you throw Q in there also.
With the help of some friends I’ve compiled a count of words by letter. (Of course this is for US English)
s : 31675 10.856%
c : 25994 8.909%
p : 23936 8.204%
a : 17704 6.068%
m : 17330 5.940%
d : 16463 5.643%
b : 16076 5.510%
r : 15406 5.280%
t : 15127 5.185%
e : 11457 3.927%
h : 11510 3.945%
f : 10441 3.579%
i : 10346 3.546%
g : 9899 3.393%
u : 9272 3.178%
l : 9263 3.175%
o : 9092 3.116%
n : 7445 2.552%
w : 6584 2.257%
v : 4747 1.627%
k : 4484 1.537%
j : 3041 1.042%
z : 1464 0.502%
q : 1446 0.496%
y : 1206 0.413%
x : 355 0.122%
I think what I find interesting is the 3% range. It makes sense when you think about it, but actually seeing it is something else.
I started out with an alphabet much like this:
$alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
But after reworking based on the above data I’ve come up with this:
$alphabet = "SSSSSCCCCCPPPPPAAAAAMMMM DDDDBBBBRRRRTTTTEEEHHHFFFIIIGG GUUULLLOOONNWWVVKKJJZQYX";
I started at the bottom and worked my way up from X to S. Each group level got one vote. So Q only gets one letter in my modified alphabet due to the lack of words that start with it. Now T gets four because it has more. I did group every thing above 6% together because I didn’t want S to come up a lot even though there are more words for it.
October 4, 2008