tag cloud a mailing list

From: Matthew Keys 
Does anyone know how to create a tag clouds based on the body of an email? The google gods point me in the direction of outlook pluggins but I'm looking for something more linux cli scriptable; maybe something that could parse through an exported mailbox/folder.

=============================================================== From: Sean Brewer ------------------------------------------------------ If you can export the e-mails easily, general algorithm is something like this: 1. Tokenize the words in the e-mail body. 2. Remove stop words (a, an, the, etc. You can find word lists, and libraries like NLTK have them built in) 3. Use stemming algorithm to reduce word tokens to their, I think the correct vocabulary is, free morpheme (e.g. convert the token word "passing" to "pass") 4. Rank by frequency of result. That should get you in the neighborhood.

=============================================================== From: Matt Keys ------------------------------------------------------ Thanks for the clues! It looks like python may be the winner this time.

=============================================================== From: Dan Lyke ------------------------------------------------------ Someone made a comment about word clouds on Facebook yesterday, and being the smartass that I am I couldn't resist: perl -e 'while () { $c{$

=============================================================== From: Sean Brewer ------------------------------------------------------ Actually, you want to do something called lemmaisation, not stemming, although they are related, stemming does something slightly different. Lemmaisation does what I described. I can probably whip up a dirty example with python and nltk.

=============================================================== From: Sean Brewer ------------------------------------------------------ I ran across this: https://github.com/larsmans/weighwords It might make what you want to do even easier.

=============================================================== From: Matt Keys ------------------------------------------------------ I ran across a few like that, too. I'm a bit confused as to the difference between a word cloud and a tag cloud. I'm guessing tag clouds presume that you've attached some form of tag to an example text, which the code would use to sort upon whereas word clouds you just point the code to a pile of text that has not been tagged/grouped?

=============================================================== From: Matt Keys ------------------------------------------------------ cat infile | tr -cs "[:alnum:]" " " | sort | uniq -c | sort -rn

=============================================================== From: Sean Brewer ------------------------------------------------------ Yeah, I think that's the difference. Code for the word cloud makes a cloud for most commonly used words.

=============================================================== From: Sean Brewer ------------------------------------------------------ Yeah you could do that, but you still have to do extra processing if you want anything useful. " | sort | uniq -c | sort -rn

=============================================================== From: Sean Brewer ------------------------------------------------------ I forgot to add, that you could use all that stuff to find a probable topic of a conversation, which is a basically a tag. I thought that might be the direction you were heading. I could be wrong.

=============================================================== From: Sean Brewer ------------------------------------------------------ Here's an example of what I'm thinking: https://gist.github.com/4324904 It's in ruby, though. I found a neat stemmer/lemmatizer algorithm and an implementation in ruby, but not in python. Here's example output: https://gist.github.com/4324904#comment-657626

=============================================================== From: Matt Keys ------------------------------------------------------ Nice start! I started working on it in python and pointing to a mbox source but I keep getting hung up on the method of extraction. I can't decide if I should focus on the subject or the body... or maybe I should focus on both? The subject is usually pretty condensed to begin with and I'm thinking that'd be the smarter place to start... but it wouldn't be as thorough. The body throws in problems like possible multipart messages, strange encodings, etc. It would be interesting to see different results using a chugalug export of maybe the month of December.

=============================================================== From: Mike Harrison ------------------------------------------------------ Laughing.. because I had thought the same thing. add chugalugextract.txt to: http://chugalug.org And you can have a copy of the MySQL table that my bot attempts to extract from email and parse into the data that makes the Chugalug website A little over 2k primary messages with replies.

=============================================================== From: Matt Keys ------------------------------------------------------ That'll certainly work for test data, thanks! I think this may be a good match for the Splunk for IMAP app :)