OT: Parsing

From: Marshall <marshall.spight_at_gmail.com>
Date: 29 Mar 2007 11:55:45 -0700
Message-ID: <1175194544.998893.253380_at_l77g2000hsb.googlegroups.com>

On Mar 29, 9:44 am, "Aloha Kakuikanu" <aloha.kakuik..._at_yahoo.com> wrote:
> On Mar 29, 10:28 am, "Marshall" <marshall.spi..._at_gmail.com> wrote:
> > I think the current situation with regards to BNF not being a
> > widely held technical skill has a lot to do with the historical
> > situation regarding transformation into LALR etc. That is
> > not only "labor intensive, manual" but also cognitively difficult.
> > Consider trying to explain to someone the difference between
> > shift-reduce conflicts and reduce-reduce conflicts, for example.
> This is again declarative versus procedural. BNF is declarative
> description of the language, while automata is procedural. Automaton
> processes one symbol of the at a time, and I never has been able keep
> my mind focused on the intricacies of state management and
> backtracking. BNF describes what rules a whole set of strings obeys
> to.

Really? I have found backtracking (at least in parsers) to be relatively easy to think about. I am speaking of a system in which one writes a grammar in EBNF and a backtracking parser is generated for it. I am not saying it ever surprises me, but I've generally found it easier to work with than other parser solutions.

Backtracking parsers also have the advantage of unlimited lookahead. Not that you necessarily want to make regular use of that, but occasionally a rule with some modest amount of lookahead is very, very convenient, compared to having to have the first token in every rule of a production be determining.

> > Lately we hear a lot about Earley parsing and other techniques.
> > Also, the trend lately is for scannerless parsers, or at least
> > parsers generators where the scanning and the parsing are
> > described by a unified format.
> CYK is conceptually simpler method. Maybe the reason why I like it so
> much is because in the essence CYK is calculating transitive closure.

Wow, I don't think I've come across that one before. I'll have to look it up.

> > For myself, I have high hopes for the wider adoption of parser
> > generators through the vehicle of backtracking parsers. They
> > have poor worst-case performance, but typical performance
> > is fine. In fact my understanding is that if you give an ll(1) grammar
> > to a backtracking parser generator it will generate a parser
> > as good as what you would get out of a parser generator
> > that *only* accepted ll(1). The advantage of the backtracking
> > parser is that it will accept a much wider range of grammars,
> > making it more accessible to everyman.
> It is not only the parser engine implementation. The BNF language
> should be improved as well. Two operations is good to advance the
> theory, but practice need a rich set of killer operations.


But can't this be done with just syntax? Every fancy construct in every variation of EBNF I've seen is expressible as some combination of conjunction and disjunction.

Certainly some things like default whitespace handling are desirable. Being able to switch parsing contexts (like for handling being inside comments differently) is unnecessary with scannerless parsers. Oh, and automatic AST building would be great! Having to write actions for every rule is an annoying drag, and could be fully automated.

Here's something: I want productions to be first class entities! That means that productions can composable. (And generated at runtime, although it is less clear whether this is useful.)

First class productions paves the way for parameterizable productions. This doesn't necessarily sound useful at first but once you start to use it it's awesome. There tends to be some self-similarity in grammars that is not capturable currently; BNF has no abstraction facilities.

A simple example:

one_or_more(sep, rule) : rule ( sep rule )* ;

Now the idea of "one or more with a separator" is a first-class construct.

one_or_more(",", expr) // one or more comma-separated expressions

one_or_more(";", stmt) // one or more semicolon separated statements


That "rich set of killer operations" can now be a "standard library" of parameterized productions, reducing the need to tweak the syntax too much, or to try to anticipate every grammars' future needs.

> > Another gripe I have is a common limitation with the handling
> > of different recursivity of rules. A top-down parser cannot
> > handle a left-recursive rule, a bottom-up parser can't handle
> > a right recursive rule. Left recursive rules rules naturally model
> > right associative operators; right recursive rules naturally
> > model left associative operators. Almost every language of
> > interest will have both left associative and right associative
> > operators. Yes, there are transformations you can do that
> > will turn one into the other, but those are, again, manual
> > and labor intensive, not to mention cognitively complex.
> I'm past the state where I evaluated off the shelf parsers. CYK
> algorithm is so simple, that even mediocre programmer can code it.
> Likewise, Earley. They are a bit slower, but history always proves
> that programmer time is more valuable than the machine.

I've always found recursive descent dead simple, and it's extremely easy to understand the relationship between the grammar and the implementation. Likewise backtracking recursive descent.

Have to read about CYK.

Marshall Received on Thu Mar 29 2007 - 20:55:45 CEST

Original text of this message