<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Nice start! I started working on it in
python and pointing to a mbox source but I keep getting hung up on
the method of extraction. I can't decide if I should focus on the
subject or the body... or maybe I should focus on both? The
subject is usually pretty condensed to begin with and I'm thinking
that'd be the smarter place to start... but it wouldn't be as
thorough. The body throws in problems like possible multipart
messages, strange encodings, etc. It would be interesting to see
different results using a chugalug export of maybe the month of
December.<br>
<br>
On 12/17/2012 11:07 PM, Sean Brewer wrote:<br>
</div>
<blockquote
cite="mid:CANEHAucoY5gJLPu6xh8DdxosScYDko3tfLBfuQqpVxYMsZk0Lw@mail.gmail.com"
type="cite">Here's an example of what I'm thinking: <a
moz-do-not-send="true" href="https://gist.github.com/4324904">https://gist.github.com/4324904</a>
<div><br>
</div>
<div>It's in ruby, though. I found a neat stemmer/lemmatizer
algorithm and an implementation in ruby, but not in python.</div>
<div><br>
</div>
<div>Here's example output: <a moz-do-not-send="true"
href="https://gist.github.com/4324904#comment-657626">https://gist.github.com/4324904#comment-657626</a><br>
<br>
<div class="gmail_quote">On Sat, Dec 15, 2012 at 8:06 PM, Sean
Brewer <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:seabre986@gmail.com" target="_blank">seabre986@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">I forgot
to add, that you could use all that stuff to find a probable
topic of a conversation, which is a basically a tag. I
thought that might be the direction you were heading. I
could be wrong.
<div class="HOEnZb">
<div class="h5"><br>
<br>
<div class="gmail_quote">
On Sat, Dec 15, 2012 at 1:00 PM, Sean Brewer <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:seabre986@gmail.com" target="_blank">seabre986@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
Yeah, I think that's the difference. Code for the
word cloud makes a cloud for most commonly used
words.
<div>
<div><br>
<br>
<div class="gmail_quote">On Sat, Dec 15, 2012 at
4:34 AM, Matt Keys <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:mk6032@yahoo.com"
target="_blank">mk6032@yahoo.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<div>I ran across a few like that, too.
I'm a bit confused as to the difference
between a word cloud and a tag cloud.
I'm guessing tag clouds presume that
you've attached some form of tag to an
example text, which the code would use
to sort upon whereas word clouds you
just point the code to a pile of text
that has not been tagged/grouped?
<div>
<div><br>
<br>
On 12/15/2012 03:44 AM, Sean Brewer
wrote:<br>
</div>
</div>
</div>
<div>
<div>
<blockquote type="cite">I ran across
this: <a moz-do-not-send="true"
href="https://github.com/larsmans/weighwords"
target="_blank">https://github.com/larsmans/weighwords</a>
<div><br>
</div>
<div>It might make what you want to
do even easier.<br>
<br>
<div class="gmail_quote">On Sat,
Dec 15, 2012 at 2:14 AM, Sean
Brewer <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:seabre986@gmail.com"
target="_blank">seabre986@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">Actually,
you want to do something
called lemmaisation, not
stemming, although they are
related, stemming does
something slightly different.
Lemmaisation does what I
described.
<div> <br>
</div>
<div>I can probably whip up a
dirty example with python
and nltk.
<div>
<div><br>
<br>
<div class="gmail_quote">On
Fri, Dec 14, 2012 at
10:14 AM, Sean Brewer
<span dir="ltr"><<a
moz-do-not-send="true" href="mailto:seabre986@gmail.com" target="_blank">seabre986@gmail.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div>If you can
export the e-mails
easily, general
algorithm is
something like
this:</div>
<div>
<div>1. Tokenize
the words in the
e-mail body.</div>
<div>2. Remove
stop words (a,
an, the, etc.
You can find
word lists, and
libraries like
NLTK have them
built in)</div>
<div>3. Use
stemming
algorithm
to reduce word
tokens to their,
I think the
correct
vocabulary is,
free morpheme
(e.g. convert
the token word
"passing" to
"pass")</div>
<div>4. Rank by
frequency of
result.</div>
<div><br>
</div>
<div>That should
get you in the
neighborhood.</div>
<br>
<div
class="gmail_quote">
<div>On Fri, Dec
14, 2012 at
9:11 AM,
Matthew Keys <span
dir="ltr"><<a
moz-do-not-send="true" href="mailto:mk6032@yahoo.com" target="_blank">mk6032@yahoo.com</a>></span>
wrote:<br>
</div>
<blockquote
class="gmail_quote"
style="margin:0
0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div>
<div>
<div
style="font-size:12pt;font-family:times
new roman,new
york,times,serif">Does anyone know how to create a tag clouds based on
the body of an
email? The
google gods
point me in
the direction
of outlook
pluggins but
I'm looking
for something
more linux cli
scriptable;
maybe
something that
could parse
through an
exported
mailbox/folder.<br>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
</div>
<br>
</blockquote>
<br>
</body>
</html>