#
I’ve downloaded all of my past LJ entries (since May 2003) to my desktop (you can download yours here). The files total 1.2MB of text. I’m going to do some text analysis with it, just for fun (no laughing!). I wrote up a small python script to parse the files and count words, here are my most common (with counts):
the: 6314
a: 6013
i: 5701
to: 4854
of: 3198
and: 3067
it: 2244
in: 2058
that: 1812
for: 1782
is: 1514
my: 1489
This list is somewhat misleading because I haven’t stripped out HTML yet, so a and i are higher than they should be.
This is just a proof of concept… I’ve got a lot of ideas for more interesting things to do with the data. Any ideas for things to explore? Word bigrams are at the top of my list at the moment, but parsing that well is a bit more complicated than simple word extraction.
April 18, 2007 No Comments