I've not had a formal CS education. So the first time I had to write a mini compiler, I decided to do it by hand, for its educational value, rather than use a parser generator. Google got me Crenshaw's paper (Lets build a compiler), and I remember it was very simple to follow his code to write my own. So yes, much recommended.
Agreed. I don't have a CS background either and despite reading a few textbooks, I could never grok how one would write a parser for a language. After reading Crenshaw's paper, I was able to put together a quick & dirty scripting language for a text processing program I was writing.
Crenshaw's approach does have downsides; as noted in the article, there's no internal representation of the program so optimisations like short-circuit evaluation of logical operations are difficult, if not impossible...
"I decided to do it by hand, for its educational value, rather than use a parser generator."
My experience was quite different. I am glad for having started with a parser generator - this got the tedious part out of the way so that I could think of how to generate code.
I later learned the theory behind various kinds of parsers, but I think that a beginner in compilers can safely leave it out.
I find it easier to write a parser by hand than use a parser generator. For most languages (C++ is a notable exception) writing a lexer & parser is a weekend job, and is far and away the easiest part.
C++ looks like it ought to be OK to parse, but turns out to be completely evil. (flamebait mode on) Bit like the rest of the language really. (flamebait apologia: I spend most of my time coding C++; sometimes I even enjoy it...)
Just lexing C++ is a major project, because of the 'phases of translation', the preprocessor, the preprocessor tokens, etc., and then trying to make it fast.
I agree and there used to be (maybe it's still the case) problems with parser generators if you wanted to have good error recovery and reporting to the user. It's also very telling that sometimes--contrary to what people might expect--parsers pose a substantial problem in production systems: http://cacm.acm.org/magazines/2010/2/69354-a-few-billion-lin...:
Law: You can't check code you can't parse. Checking code deeply requires understanding the code's semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the widely believed myth that programming languages exist.
Fair point. But I have been in the situation where I've got a RDP that outputs what I originally wanted, but now I want, e.g., to output an intermediate form, and I'm fucked.
So writing it isn't all that hard. But changing it after you've written it is a recipe for disaster. I did that a couple of times and looked for A Better Way.
A tool like ANTLR, OTOH, makes re-purposing (and initial development and testing) easy, but at the expense of runtime performance.
Perhaps the lesson is, "write a few RDPs to get the hang of it and then make life easy on yourself"?
The parser is the easy part about writing a compiler, unless you design a language that is difficult to compile. Optimizing the codegen is the hard part. That's where you have to think really carefully about throughput versus code quality.
I wouldn't call writing a parser easy, but I would call it relatively straight-forward. Parsing code can be quite messy. My master's advisor likes to say that writing a parser is a character-building exercise and nothing but.
By the way, if you decide to write a parser, learn about Packrat parsing[1] and ignore everything else (especially stuff from the Pascal days). Back when memory was scarce it was important to build multistage compilers that optimized for low memory usage (because you had to keep the structures in memory). Nowadays that's not even a concern and packrat parsing is simpler and will be faster than almost every cute technique from 1970.
Hate to disagree with you. But having written many parsers (and even a few parser generators) it really isn't that difficult if you understand the theory. People encounter problems usually because they don't understand the theory and don't know how to write a grammar.
I have never used a PEG based parsing system so I can't speak to it. The wikipedia page seems to be riddled with inaccuracies. For instance it strongly indicates that context free grammars often end up with unresolvable ambiguities eg.
> "The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a context-free grammar, this construct yields the classic dangling else ambiguity.)"
Nothing could be further from the truth. It is relatively easy to resolve an ambiguity arising from this structure. Indeed, Aho et all use it as an example for how to eliminate ambiguity from a language[pages 211-212 in the second edition. section 4.3.2 "Eliminating Ambiguity"].
Perhaps you could explain why you think PEG and Packrat are superior to CFG and standard parsing tools (eg. recursive descent or LALR and GLR parser generators).
Hate to disagree with you. But having written many parsers (and even a few parser generators) it really isn't that difficult if you understand the theory. People encounter problems usually because they don't understand the theory and don't know how to write a grammar.
From a software engineering perspective it's annoying. It's not terribly difficult and it's not that interesting, it's just annoying.
Nothing could be further from the truth. It is relatively easy to resolve an ambiguity arising from this structure. Indeed, Aho et all use it as an example for how to eliminate ambiguity from a language[pages 211-212 in the second edition. section 4.3.2 "Eliminating Ambiguity"].
I agree that this is a bad example but you should remember from the literature that CFG is undecideably ambiguous. It's of course resolvable by GLR, but that's not the point.
For more information I'd recommend you read these papers, starting with Ford's master's thesis. [1] Also remember that I wasn't talking about using packrat vs. a standard GLR parser -- I was talking about what you should do if you want to learn how to implement a parser. Packrat is easier to get right than the corner cases in GLR.
> For more information I'd recommend you read these papers, starting with Ford's master's thesis. [1] Also remember that I wasn't talking about using packrat vs. a standard GLR parser -- I was talking about what you should do if you want to learn how to implement a parser. Packrat is easier to get right than the corner cases in GLR.
Thanks for the link. I will definitely check it out.
> From a software engineering perspective it's annoying. It's not terribly difficult and it's not that interesting, it's just annoying.
It is not clear what you think is annoying (eg. learning the theory or writing a parser). I take it from later on in the post you believe writing the parser is annoying. I don't think writing parsers is in general annoying, but there are some languages which are annoying to write a parser for.
I can't agree with your advisor's statement about it being only a "character building" exercise. There are times when you need to write a parser. However, I have to admit if I can get around it I will. For instance last year instead of writing a Java parser I hooked the OpenJDK parser and serialized out an AST. Much easier and a lot less code than a parser. I would always recommend doing something like this for an existing language. It is too easy to mess up a corner case in the grammar.
tl;dr : I think we mostly agree, I just haven't read up on PEG's and Packrat.
My limited experience with PEG parsing is that, in its simplest incarnation, it's not all that fast. Packrat parsing supposedly actually makes it slower in the normal case (where you don't backtrack much).
I think the point of this though is to first enable people to write compilers without the overwhelming complexity. And only dive in deeper after they outgrow their toy language and its inefficiencies.