[Chugalug] tag cloud a mailing list

Matt Keys mk6032 at yahoo.com
Wed Dec 19 11:27:11 UTC 2012

Nice start! I started working on it in python and pointing to a mbox 
source but I keep getting hung up on the method of extraction. I can't 
decide if I should focus on the subject or the body... or maybe I should 
focus on both? The subject is usually pretty condensed to begin with and 
I'm thinking that'd be the smarter place to start... but it wouldn't be 
as thorough. The body throws in problems like possible multipart 
messages, strange encodings, etc. It would be interesting to see 
different results using a chugalug export of maybe the month of December.

On 12/17/2012 11:07 PM, Sean Brewer wrote:
> Here's an example of what I'm thinking: https://gist.github.com/4324904
> It's in ruby, though. I found a neat stemmer/lemmatizer algorithm and 
> an implementation in ruby, but not in python.
> Here's example output: https://gist.github.com/4324904#comment-657626
> On Sat, Dec 15, 2012 at 8:06 PM, Sean Brewer <seabre986 at gmail.com 
> <mailto:seabre986 at gmail.com>> wrote:
>     I forgot to add, that you could use all that stuff to find a
>     probable topic of a conversation, which is a basically a tag. I
>     thought that might be the direction you were heading. I could be
>     wrong.
>     On Sat, Dec 15, 2012 at 1:00 PM, Sean Brewer <seabre986 at gmail.com
>     <mailto:seabre986 at gmail.com>> wrote:
>         Yeah, I think that's the difference. Code for the word cloud
>         makes a cloud for most commonly used words.
>         On Sat, Dec 15, 2012 at 4:34 AM, Matt Keys <mk6032 at yahoo.com
>         <mailto:mk6032 at yahoo.com>> wrote:
>             I ran across a few like that, too. I'm a bit confused as
>             to the difference between a word cloud and a tag cloud.
>             I'm guessing tag clouds presume that you've attached some
>             form of tag to an example text, which the code would use
>             to sort upon whereas word clouds you just point the code
>             to a pile of text that has not been tagged/grouped?
>             On 12/15/2012 03:44 AM, Sean Brewer wrote:
>>             I ran across this: https://github.com/larsmans/weighwords
>>             It might make what you want to do even easier.
>>             On Sat, Dec 15, 2012 at 2:14 AM, Sean Brewer
>>             <seabre986 at gmail.com <mailto:seabre986 at gmail.com>> wrote:
>>                 Actually, you want to do something called
>>                 lemmaisation, not stemming, although they are
>>                 related, stemming does something slightly different.
>>                 Lemmaisation does what I described.
>>                 I can probably whip up a dirty example with python
>>                 and nltk.
>>                 On Fri, Dec 14, 2012 at 10:14 AM, Sean Brewer
>>                 <seabre986 at gmail.com <mailto:seabre986 at gmail.com>> wrote:
>>                     If you can export the e-mails easily, general
>>                     algorithm is something like this:
>>                     1. Tokenize the words in the e-mail body.
>>                     2. Remove stop words (a, an, the, etc.  You can
>>                     find word lists, and libraries like NLTK have
>>                     them built in)
>>                     3. Use stemming algorithm to reduce word tokens
>>                     to their, I think the correct vocabulary is, free
>>                     morpheme (e.g. convert the token word "passing"
>>                     to "pass")
>>                     4. Rank by frequency of result.
>>                     That should get you in the neighborhood.
>>                     On Fri, Dec 14, 2012 at 9:11 AM, Matthew Keys
>>                     <mk6032 at yahoo.com <mailto:mk6032 at yahoo.com>> wrote:
>>                         Does anyone know how to create a tag clouds
>>                         based on the body of an email? The google
>>                         gods point me in the direction of outlook
>>                         pluggins but I'm looking for something more
>>                         linux cli scriptable; maybe something that
>>                         could parse through an exported mailbox/folder.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://chugalug.org/pipermail/chugalug/attachments/20121219/12d68b7f/attachment.html>

More information about the Chugalug mailing list