Open sourcing our email signature parsing library

chatmasta · on July 25, 2014

This is cool. It's great to see so many small-medium size projects implementing machine learning algorithms. It wasn't that long ago that such a capability was far beyond the capabilities of an average programmer. Libraries like numpy/scikit, pyml, and their counterparts in other languages have made this far more prevalent.

I recently helped my girlfriend (bioinformatics PhD! /brag) implement an SVM in Python for feature selection of 20k genes, to determine which could be used to classify a tumorous cell. I was amazed that with less than 20 lines of Python, she had a fast, functional SVM that could classify with ~80% accuracy. That's damn impressive.

In general the scientific community is not filled with the highest quality programmers, so it's great that we are seeing development in easily accessible ML toolkits, because they enable sub-par scientific programmers to employ modern ML algorithms without needing to know the details of how to implement them.

Tl;dr Cool project, glad to see machine learning libraries taking off, will lead to curing cancer :)

mqsiuser · on July 25, 2014

> In general the scientific community is not filled with the highest quality programmers

Some will think different. Peer groups tend to think they are the greatest (academia, doing a phd... I know of some with very good self-esteem)

wiml · on July 24, 2014

The example with dashes made me chuckle, because there actually is a standard for setting off a signature --- two dashes with one trailing space. Searching for this works tolerably well in my experience; if it's in a message and not more than X lines from the end, it's a pretty reliable indicator.

Where Mailgun's library looks really useful to me is in parsing HTML mail (ironic that a supposed semantic markup language makes this problem harder to solve, but that's the net for you...)

michaelmior · on July 25, 2014

How is this a standard? I agree that it's fairly common, but I see a lot of different variations just in my own inbox.

LukeShu · on July 25, 2014

It is literally a "proposed standard": http://tools.ietf.org/html/rfc3676#section-4.3

e12e · on July 25, 2014

Indeed. Checking for a signature in email is pretty easy[1]: look for eol-dash-dash-space-eol ("\^-- \$"), if found, cut there. If not found, the sender and/or email client can't be trusted to compose proper email -- don't attempt automatic signature stripping, and forward the whole (probably top-posted) mess.

I'm only half-joking.

[1] Because, if "proper" quoting is used and the client can't properly strip signatures -- at least those included signatures will be "\^>\+ -- $", not "\^-- \$". Now assuming a proper email client/user, those sigs should've been stripped anyway... But such an assumption is likely to lead to tears and unhappiness anyway...

It even works for "proper" top-posters (as if there was such a thing) -- because the last reply will come first, followed by a dash-dash-space delimited signature, followed by all the stuff you'd typically strip out in a reply.

Brad2earth · on July 25, 2014

my team has parsed over a billion emails in the past 3 years auto-updating our clients' address books.

the "-- " is indeed the standard and most common ever since Usenet in 94, but of course we've built a ton of variation within our algorithms to handle every thing else you might see.

Feel free to check out our infographic on what you'll find in the average professional's email signature: http://www.evercontact.com/blog/infographic-the-anatomy-of-a...

8_hours_ago · on July 25, 2014

FYI: the infographic shows the delimiter as "--" instead of "-- "

e12e · on July 25, 2014

As I mentioned elsewhere, that's more than a small nitpick, because something like:

    XI
    --
    Chapter XI...

Is perfectly fine in text-email -- so adding the space on the end is very useful -- as you'd rarely need to escape "-- " on a line by itself -- not so for just "--".

BorisMelnik · on July 25, 2014

figure dash, em dash, en dash or horizontal bar?

wiml · on July 25, 2014

Hyphen/minus. ASCII 0x2D or EBCDIC 0x60.

byoung2 · on July 25, 2014

It would be nice if email just had a separate signature field (e.g. subject, body, signature)

orliesaurus · on July 24, 2014

A little gem to solve a royal pain in the back, that anyone working with email data can relate to

wyred · on July 25, 2014

Should we be concerned that they are looking at our email contents in order to build this system?

  We did a lot of research, looked at all the variations of
  email that passes through Mailgun

codezero · on July 25, 2014

Probably not. Here are enough public mailing lists to use as training sets I think. They could have used company mail to train on as well.

djyaz1200 · on July 24, 2014

We use Mailgun and their parsing tool @ sendsmart.com and we LOVE IT! Thanks Mailgun!!

aantix · on July 24, 2014

This is fantastic that they are open sourcing this. I've tried stripping out email quotations in the past and it's definitely a hard problem.

ape4 · on July 24, 2014

I guess there is the "standard" of two dashes before the sig

e12e · on July 25, 2014

To clarify, the nice thing about dash-dash-space, rather than simply dash-dash ("-- " vs "--") is that the former will very rarely need escaping (because you could conceivably delimit parts of a message by two dashes, like:

    IX
    --
    Chapter IX.

Without the trailing space, everything below your "sub-header" won't be cut).

aeroevan · on July 24, 2014

two dashes and a space before your ~/.signature :)