RM and abstract syntax trees

From: David BL <davidbl_at_iinet.net.au>
Date: Mon, 29 Oct 2007 20:06:44 -0700
Message-ID: <1193713604.283167.146850_at_e34g2000pro.googlegroups.com>



In the following I compare different techniques for representing an Abstract Syntax Tree (AST), concluding that RM is poorly suited.

In LISP, S-expressions (parenthesized lists) are well suited to representing ASTs because S-expressions naturally nest, allow for native representation of data types like 32 bit integers, and symbolic names allow for AST nodes to be tagged with their type.

S-expressions are implemented using pointers. The interesting thing is that the use of pointers is hidden from the programmer. For example, it is impossible to print the address of a cons cell.

In Prolog, the data type is called a term, and for the same reasons just mentioned, terms are well suited to representing ASTs.

In C or C++ we have a lower level means to represent ASTs using heap allocated structs or classes wired up using pointers. Various techniques can be used to avoid low level details such as memory management. As an example, a function can be written to test for value equivalence of two AST node variables, and this will check for isomorphism in the structure, but otherwise treat pointer values as insignificant.

The RM seems poorly suited to the representation of ASTs. The fundamental problem is that it is necessary for the user to assign every node of the AST a unique identifier. Let's call a spade a spade : this identifier represents a pointer value. So RM is forced to expose the equivalent of pointers directly in the representation. Furthermore, the RM has no mechanism for hiding these pointers or giving the user an interface that promotes the idea that a node logically represents a value.

This is rather paradoxical, because one of the benefits of RM over OO is meant to be the avoidance of pointers (and not their promotion)!

The inappropriateness of the RM for ASTs is highlighted when we interpret tuples as propositions or facts. Consider the following expression

(x + 1) * 3

In order to state the fact that this expression consists of the product of subexpressions '(x+1)' and '3', one must clearly assign names to the things about which we want to state facts. No wonder the RM forces us to go on a naming spree.

By contrast, in LISP we may directly represent the above expression using the following S-expression

(* (+ x 1) 3)

There is an important philosophical difference : In LISP, Prolog or C we *directly* represent the AST as an entity that is *part of the abstract computational machine*. By contrast, in the RM one is only permitted to state facts about the AST in order to represent it, almost as though the AST is some external entity that is not a part of the abstract computational machine.

Consider the following analogy: Imagine how inefficient the RM would become, if it wasn't even permitted to directly represent an integer value, but instead was forced to represent the number indirectly. For example, given the binary representation 0101 of an integer value such as 5, one could label the value with the name 'x' and represent its value indirectly by stating a fact for each '1' in the binary representation, using a separate relation for each bit position (ie power of 2). Clearly hideous, but an interesting illustration of how RM can be taken to a normal form where the only domain required allows for symbolic names.

It seems to me that the rule of thumb should be this : Do not use RM to state facts about entities that can be directly represented within the abstract computational machine.

I anticipate that this rule of thumb provides a useful insight on that rather vague notion of "semi-structured data". ie it explains exactly when and why there is data that is not suitable for direct representation in RM. Received on Tue Oct 30 2007 - 04:06:44 CET

Original text of this message