The example with dashes made me chuckle, because there actually is a standard for setting off a signature --- two dashes with one trailing space. Searching for this works tolerably well in my experience; if it's in a message and not more than X lines from the end, it's a pretty reliable indicator.
Where Mailgun's library looks really useful to me is in parsing HTML mail (ironic that a supposed semantic markup language makes this problem harder to solve, but that's the net for you...)
Indeed. Checking for a signature in email is pretty easy[1]: look for eol-dash-dash-space-eol ("\^-- \$"), if found, cut there. If not found, the sender and/or email client can't be trusted to compose proper email -- don't attempt automatic signature stripping, and forward the whole (probably top-posted) mess.
I'm only half-joking.
[1] Because, if "proper" quoting is used and the client can't properly strip signatures -- at least those included signatures will be "\^>\+ -- $", not "\^-- \$". Now assuming a proper email client/user, those sigs should've been stripped anyway... But such an assumption is likely to lead to tears and unhappiness anyway...
It even works for "proper" top-posters (as if there was such a thing) -- because the last reply will come first, followed by a dash-dash-space delimited signature, followed by all the stuff you'd typically strip out in a reply.
my team has parsed over a billion emails in the past 3 years auto-updating our clients' address books.
the "-- " is indeed the standard and most common ever since Usenet in 94, but of course we've built a ton of variation within our algorithms to handle every thing else you might see.
As I mentioned elsewhere, that's more than a small nitpick, because something like:
XI
--
Chapter XI...
Is perfectly fine in text-email -- so adding the space on the end is very useful -- as you'd rarely need to escape "-- " on a line by itself -- not so for just "--".
Where Mailgun's library looks really useful to me is in parsing HTML mail (ironic that a supposed semantic markup language makes this problem harder to solve, but that's the net for you...)