This is cool. It's great to see so many small-medium size projects implementing machine learning algorithms. It wasn't that long ago that such a capability was far beyond the capabilities of an average programmer. Libraries like numpy/scikit, pyml, and their counterparts in other languages have made this far more prevalent.
I recently helped my girlfriend (bioinformatics PhD! /brag) implement an SVM in Python for feature selection of 20k genes, to determine which could be used to classify a tumorous cell. I was amazed that with less than 20 lines of Python, she had a fast, functional SVM that could classify with ~80% accuracy. That's damn impressive.
In general the scientific community is not filled with the highest quality programmers, so it's great that we are seeing development in easily accessible ML toolkits, because they enable sub-par scientific programmers to employ modern ML algorithms without needing to know the details of how to implement them.
Tl;dr Cool project, glad to see machine learning libraries taking off, will lead to curing cancer :)
The example with dashes made me chuckle, because there actually is a standard for setting off a signature --- two dashes with one trailing space. Searching for this works tolerably well in my experience; if it's in a message and not more than X lines from the end, it's a pretty reliable indicator.
Where Mailgun's library looks really useful to me is in parsing HTML mail (ironic that a supposed semantic markup language makes this problem harder to solve, but that's the net for you...)
Indeed. Checking for a signature in email is pretty easy[1]: look for eol-dash-dash-space-eol ("\^-- \$"), if found, cut there. If not found, the sender and/or email client can't be trusted to compose proper email -- don't attempt automatic signature stripping, and forward the whole (probably top-posted) mess.
I'm only half-joking.
[1] Because, if "proper" quoting is used and the client can't properly strip signatures -- at least those included signatures will be "\^>\+ -- $", not "\^-- \$". Now assuming a proper email client/user, those sigs should've been stripped anyway... But such an assumption is likely to lead to tears and unhappiness anyway...
It even works for "proper" top-posters (as if there was such a thing) -- because the last reply will come first, followed by a dash-dash-space delimited signature, followed by all the stuff you'd typically strip out in a reply.
my team has parsed over a billion emails in the past 3 years auto-updating our clients' address books.
the "-- " is indeed the standard and most common ever since Usenet in 94, but of course we've built a ton of variation within our algorithms to handle every thing else you might see.
As I mentioned elsewhere, that's more than a small nitpick, because something like:
XI
--
Chapter XI...
Is perfectly fine in text-email -- so adding the space on the end is very useful -- as you'd rarely need to escape "-- " on a line by itself -- not so for just "--".
To clarify, the nice thing about dash-dash-space, rather than simply dash-dash ("-- " vs "--") is that the former will very rarely need escaping (because you could conceivably delimit parts of a message by two dashes, like:
IX
--
Chapter IX.
Without the trailing space, everything below your "sub-header" won't be cut).
I recently helped my girlfriend (bioinformatics PhD! /brag) implement an SVM in Python for feature selection of 20k genes, to determine which could be used to classify a tumorous cell. I was amazed that with less than 20 lines of Python, she had a fast, functional SVM that could classify with ~80% accuracy. That's damn impressive.
In general the scientific community is not filled with the highest quality programmers, so it's great that we are seeing development in easily accessible ML toolkits, because they enable sub-par scientific programmers to employ modern ML algorithms without needing to know the details of how to implement them.
Tl;dr Cool project, glad to see machine learning libraries taking off, will lead to curing cancer :)