I've not had a formal CS education. So the first time I had to write a mini comp...

kenkeenan · on Aug 26, 2011

Agreed. I don't have a CS background either and despite reading a few textbooks, I could never grok how one would write a parser for a language. After reading Crenshaw's paper, I was able to put together a quick & dirty scripting language for a text processing program I was writing.

Crenshaw's approach does have downsides; as noted in the article, there's no internal representation of the program so optimisations like short-circuit evaluation of logical operations are difficult, if not impossible...

praptak · on Aug 26, 2011

"I decided to do it by hand, for its educational value, rather than use a parser generator."

My experience was quite different. I am glad for having started with a parser generator - this got the tedious part out of the way so that I could think of how to generate code.

I later learned the theory behind various kinds of parsers, but I think that a beginner in compilers can safely leave it out.

perlgeek · on Aug 26, 2011

One of the very interesting lessons from Crenshaw's "Let's build a compiler" is that writing a simple, recursive descending parser isn't hard at all.

silentbicycle · on Aug 26, 2011

Niklaus Wirth has been saying that for quite a while. :)

His _Compiler Construction_ book is short, sweet, and has a free PDF online (http://www.inf.ethz.ch/personal/wirth/books/CompilerConstruc...). Uses Oberon.

WalterBright · on Aug 26, 2011

I find it easier to write a parser by hand than use a parser generator. For most languages (C++ is a notable exception) writing a lexer & parser is a weekend job, and is far and away the easiest part.

pja · on Aug 26, 2011

C++ looks like it ought to be OK to parse, but turns out to be completely evil. (flamebait mode on) Bit like the rest of the language really. (flamebait apologia: I spend most of my time coding C++; sometimes I even enjoy it...)

WalterBright · on Aug 26, 2011

Just lexing C++ is a major project, because of the 'phases of translation', the preprocessor, the preprocessor tokens, etc., and then trying to make it fast.

sb · on Aug 26, 2011

I agree and there used to be (maybe it's still the case) problems with parser generators if you wanted to have good error recovery and reporting to the user. It's also very telling that sometimes--contrary to what people might expect--parsers pose a substantial problem in production systems: http://cacm.acm.org/magazines/2010/2/69354-a-few-billion-lin...:

Law: You can't check code you can't parse. Checking code deeply requires understanding the code's semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the widely believed myth that programming languages exist.

buff-a · on Aug 26, 2011

Fair point. But I have been in the situation where I've got a RDP that outputs what I originally wanted, but now I want, e.g., to output an intermediate form, and I'm fucked.

So writing it isn't all that hard. But changing it after you've written it is a recipe for disaster. I did that a couple of times and looked for A Better Way.

A tool like ANTLR, OTOH, makes re-purposing (and initial development and testing) easy, but at the expense of runtime performance.

Perhaps the lesson is, "write a few RDPs to get the hang of it and then make life easy on yourself"?

kenjackson · on Aug 26, 2011

The parser is the easy part about writing a compiler, unless you design a language that is difficult to compile. Optimizing the codegen is the hard part. That's where you have to think really carefully about throughput versus code quality.

Locke1689 · on Aug 26, 2011

I wouldn't call writing a parser easy, but I would call it relatively straight-forward. Parsing code can be quite messy. My master's advisor likes to say that writing a parser is a character-building exercise and nothing but.

By the way, if you decide to write a parser, learn about Packrat parsing[1] and ignore everything else (especially stuff from the Pascal days). Back when memory was scarce it was important to build multistage compilers that optimized for low memory usage (because you had to keep the structures in memory). Nowadays that's not even a concern and packrat parsing is simpler and will be faster than almost every cute technique from 1970.

[1] http://en.wikipedia.org/wiki/Packrat_parsing

timtadh · on Aug 26, 2011

Hate to disagree with you. But having written many parsers (and even a few parser generators) it really isn't that difficult if you understand the theory. People encounter problems usually because they don't understand the theory and don't know how to write a grammar.

I have never used a PEG based parsing system so I can't speak to it. The wikipedia page seems to be riddled with inaccuracies. For instance it strongly indicates that context free grammars often end up with unresolvable ambiguities eg.

> "The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a context-free grammar, this construct yields the classic dangling else ambiguity.)"

Nothing could be further from the truth. It is relatively easy to resolve an ambiguity arising from this structure. Indeed, Aho et all use it as an example for how to eliminate ambiguity from a language[pages 211-212 in the second edition. section 4.3.2 "Eliminating Ambiguity"].

Perhaps you could explain why you think PEG and Packrat are superior to CFG and standard parsing tools (eg. recursive descent or LALR and GLR parser generators).

Locke1689 · on Aug 26, 2011

Hate to disagree with you. But having written many parsers (and even a few parser generators) it really isn't that difficult if you understand the theory. People encounter problems usually because they don't understand the theory and don't know how to write a grammar.

From a software engineering perspective it's annoying. It's not terribly difficult and it's not that interesting, it's just annoying.

Nothing could be further from the truth. It is relatively easy to resolve an ambiguity arising from this structure. Indeed, Aho et all use it as an example for how to eliminate ambiguity from a language[pages 211-212 in the second edition. section 4.3.2 "Eliminating Ambiguity"].

I agree that this is a bad example but you should remember from the literature that CFG is undecideably ambiguous. It's of course resolvable by GLR, but that's not the point.

For more information I'd recommend you read these papers, starting with Ford's master's thesis. [1] Also remember that I wasn't talking about using packrat vs. a standard GLR parser -- I was talking about what you should do if you want to learn how to implement a parser. Packrat is easier to get right than the corner cases in GLR.

[1] http://pdos.csail.mit.edu/~baford/packrat/

silentbicycle · on Aug 26, 2011

The LPEG paper (http://www.inf.puc-rio.br/~roberto/docs/peg.pdf) is also pretty good. (LPEG is a PEG parsing library for Lua.)

timtadh · on Aug 26, 2011

> For more information I'd recommend you read these papers, starting with Ford's master's thesis. [1] Also remember that I wasn't talking about using packrat vs. a standard GLR parser -- I was talking about what you should do if you want to learn how to implement a parser. Packrat is easier to get right than the corner cases in GLR.

Thanks for the link. I will definitely check it out.

> From a software engineering perspective it's annoying. It's not terribly difficult and it's not that interesting, it's just annoying.

It is not clear what you think is annoying (eg. learning the theory or writing a parser). I take it from later on in the post you believe writing the parser is annoying. I don't think writing parsers is in general annoying, but there are some languages which are annoying to write a parser for.

I can't agree with your advisor's statement about it being only a "character building" exercise. There are times when you need to write a parser. However, I have to admit if I can get around it I will. For instance last year instead of writing a Java parser I hooked the OpenJDK parser and serialized out an AST. Much easier and a lot less code than a parser. I would always recommend doing something like this for an existing language. It is too easy to mess up a corner case in the grammar.

tl;dr : I think we mostly agree, I just haven't read up on PEG's and Packrat.

kragen · on Aug 26, 2011

My limited experience with PEG parsing is that, in its simplest incarnation, it's not all that fast. Packrat parsing supposedly actually makes it slower in the normal case (where you don't backtrack much).

I still highly recommend it, though. It's very simple indeed (my self-compiling PEG parser generator is 66 lines of code: https://github.com/kragen/peg-bootstrap/blob/master/peg.md) and very expressive.

Locke1689 · on Aug 26, 2011

The production ANTLR and GLR parsers are very fast, but I think packrat is just easier to get right if you're just starting out.

joeyespo · on Aug 26, 2011

Yes, you're correct.

I think the point of this though is to first enable people to write compilers without the overwhelming complexity. And only dive in deeper after they outgrow their toy language and its inefficiencies.