Skip to content

Unparser

Dávid Németi edited this page Sep 5, 2013 · 7 revisions

The beauty of a domain-bound grammar is that it not just can drive Irony's general parser, but also can drive Sarcasm general unparser.

What is an unparser?

An unparser is the opposite, the inverse of a parser. While a parser converts (parses) text to structure (AST), an unparser converts (unparses) structure to text.

What is an unparser good for?

For several things:

  • format/reformat your text by parsing and unparsing it
  • use advanced syntax highlight (decoration) on the unparsed text
  • provide multiple views for a domain, where the team members can view and edit the domain documents in any grammars the domain has (C# or VB), in any indent style (Allman or Kernighan&Ritchie), while keeping the source files in any fixed format

To see a demonstration of these features please see Mystique, which is a multi-view editor that provides multiple editable views to the user for the same document. Mystique is mainly a demo application to demonstrate what Irony and Sarcasm are together capable of.

How does it work?

The unparser gets an AST value (object) and a bnfterm as parameters. These are typically the root of the AST and the root bnfterm of the grammar, but they can designate any subtrees in the AST. Then the unparser examines the right side of the given bnfterm, chooses a bnfterm list of the alternative bnfterm lists, determines the AST value for each bnfterms, calls itself recursively on the "AST value - bnfterm" pairs, and concatenates the result. Actually, the unparser builds the full parse tree from the AST. The recursion ends when the unparser reaches the terminals (i.e. the leaves of the parse tree), or when it encounters a BnfiTermConversion which can yield utokens directly (i.e. it has a non-null UtokenizerForUnparse member, which can be set in the grammar; see below).

Note, that there is no need to write a separate unparsing code (except the inverse conversion for BnfiTermConversions), since the unparser is driven by the grammar.

The result of unparsing

The unparser converts structure to text, however, the result of unparsing is not a string containing the text, but rather a sequence of utokens (IEnumerable<Utoken>). These utokens are similar to the tokens in the parser, they are actually words and punctuations. They hold references to the object in the AST and the bnfterm they came from, which can be useful if you want to use the unparser in an editor. Decorations (syntax highlight) are decorated onto utokens as well.

Unparser unparser = new Unparser(grammar);
IEnumerable<Utoken> unparsedUtokens = unparser.Unparse(expression);

Of course, it you just want to get the text, it's easy:

Unparser unparser = new Unparser(grammar);
IEnumerable<Utoken> unparsedUtokens = unparser.Unparse(expression);
string unparsedText = unparsedUtokens.AsText(unparser);

or simply:

Unparser unparser = new Unparser(grammar);
string unparsedText = unparser.Unparse(expression).AsText(unparser);

Choosing between alternatives

If the right side of a grammar rule contains more than one bnfterm lists (separated by the "logical or" operator), then the unparser has to choose between them.

Automatically

The choice can be done automatically by the unparser, and most of the time it works great, so you probably never need to do it manually. The behaviour of the choice depends on the type of the bnfterm.

BnfiTermRecord

The unparser examines the AST object and tries to match the members of the object to their bound bnfterms. The best match wins.

var If = new BnfiTermRecord<D.If>();

If.Rule =
    IF
    + LEFT_PAREN
    + Expression.BindTo(If, t => t.Condition)
    + RIGHT_PAREN
    + THEN
    + Statement.BindTo(If, t => t.Body)
    |
    IF
    + LEFT_PAREN
    + Expression.BindTo(If, t => t.Condition)
    + RIGHT_PAREN
    + THEN
    + Statement.BindTo(If, t => t.Body)
    + ELSE
    + Statement.BindTo(If, t => t.ElseBody)
    ;

E.g. if we have an "if (true) then return 0;" as parsed AST value, then the unparser is going to choose the first bnfterm list (i.e. IF + LEFT_PAREN + Expression.BindTo(If, t => t.Condition) + RIGHT_PAREN + THEN + Statement.BindTo(If, t => t.Body)). If we have "if (true) then return 0; else return 1;" then it is going to choose the second bnfterm list.

BnfiTermChoice

If the type of the AST value is not equal to the domain type of the BnfiTermChoice, then that bnfterm is going to win at right side, which domain type is closer to the type of the AST value in the inheritance chain (otherwise the unparser will go deeper to make its decision):

var Expression = new BnfiTermChoice<DE.Expression>();
var BinaryExpression = new BnfiTermRecord<DE.BinaryExpression>();
var UnaryExpression = new BnfiTermRecord<DE.UnaryExpression>();
var NumberLiteral = new BnfiTermConversion<DE.NumberLiteral>();
var Expression = new BnfiTermChoice<DE.Expression>();

Expression.SetRuleOr(
    BinaryExpression,
    UnaryExpression,
    NumberLiteral,
    );

So if we have an AST value with type DE.BinaryExpression then the unparser will choose BinaryExpression bnfterm.

BnfiTermConversion

If the BnfiTermConversion has a specific value assigned to it, then the choice will be based on whether this value is equal to the AST value or not. If the BnfiTermConversion can yield utokens directly, then it will be always chosen. Otherwise the unparser will go deeper to make its choice.

Normal way

Here the domain types of the bnfterms (BinaryOperator, ADD_OP, SUB_OP, DIV_OP) are all DE.BinaryOperator, so the unparser cannot choose by BnfiTermChoice, so it goes down to the BnfiTermConversions at right side of the grammar rule:

BnfiTermConversion<DE.BinaryOperator> ADD_OP = TerminalFactoryS.CreateKeyTerm("+", DE.BinaryOperator.Add);
BnfiTermConversion<DE.BinaryOperator> SUB_OP = TerminalFactoryS.CreateKeyTerm("-", DE.BinaryOperator.Sub);
BnfiTermConversion<DE.BinaryOperator> MUL_OP = TerminalFactoryS.CreateKeyTerm("*", DE.BinaryOperator.Mul);
BnfiTermConversion<DE.BinaryOperator> DIV_OP = TerminalFactoryS.CreateKeyTerm("/", DE.BinaryOperator.Div);

var BinaryOperator = new BnfiTermChoice<DE.BinaryOperator>();

BinaryOperator.Rule = ADD_OP | SUB_OP | MUL_OP | DIV_OP;

So if we have a DE.BinaryOperator.Sub AST value, the unparser will choose the SUB_OP bnfterm.

Yield utokens directly

You can decide to yield utokens directly for a certain BnfiTermConversion. So instead of this (which yields utokens for each namespace part and each dot character):

NamespaceName.Rule =
    IDENTIFIER
    .PlusList(DOT)
    .ConvertValue(
        _identifiers => new NameRef(string.Join(DOT.Text, _identifiers)),
        _nameRef => _nameRef.Value.Split(new string[] { DOT.Text }, StringSplitOptions.None)
    );

you could write this (which will yield only one utoken for the whole namespace string):

NamespaceName.Rule =
    IDENTIFIER
    .PlusList(DOT)
    .ConvertValue(
        _identifiers => new NameRef(string.Join(DOT.Text, _identifiers)),
        NoUnparseByInverse<NameRef, IEnumerabe<string>>()
    );

NamespaceName.UtokenizerForUnparse =
    (formatProvider, reference, astValue) => new [] { UtokenValue.CreateText(nameRef.Value, reference) };

Manually

If, for any reason, the automatic way how the unparser chooses between the alternatives does not suit you, you can do it manually by overriding the automatic behaviour. To do this you have to use an UnparseHint by calling the method SetUnparsePriority from inside one or more alternatives in your grammar rule.

You can calculate an int? number by examining the astValue and the childAstValues if needed, and return with it. The choice will be the highest non-null priority (the first highest, if there are more equal highest); if there is not any, then the automatic behaviour will be applied by skipping those alternatives that have null priorities.

So if you return a non-null number then this number will be compared to other non-null numbers in other alternatives in your grammar rule, and the highest will win. If you return null then that alternative will not be chosen.

This feature is used by Sarcasm' JsonGrammar where the NUMBER is a typeless numberliteral, thus its domain type is object. The problem is that we only want to unparse an object as numberliteral if the object is an int or double. The other problematic situation is when we have the null value, in this case we want to unparse it as a null constant.

Object.Rule =
    NUMBER + SetUnparsePriority((astValue, childAstValues) => astValue is int || astValue is double ? (int?)1 : null)
    |
    STRING
    |
    BOOLEAN
    |
    Array
    |
    NULL + SetUnparsePriority((astValue, childAstValues) => astValue == null ? (int?)1 : null)
    |
    OBJECT_BEGIN
    + KeyValuePairs.ConvertValue(KeyValuePairsToObject, ObjectToKeyValuePairs)
    + OBJECT_END
    ;

Unparsing expressions

When an expression is in the "tree structure" the structure itself determines the operands of the operators, or in other words: the "scope" of the operators.

Example

Add(Mul(1, 2), 3) means that the Mul operator has 1 and 2 as operands, while the Add operator has the value of the Mul(1, 2) expression and 3 as operands.

Mul(Add(1, 2), 3) means that the Add operator has 1 and 2 as operands, while the Mul operator has the value of the Add(1, 2) expression and 3 as operands.

Now, when we are unparsing this tree structure into text, we have to choose a notation (the notation is determined by the grammar rules). The most prevalent notation is the infix notation (the example grammars determine infix notation), but the notation can be prefix, postfix or function-style as well (function-style is the notation that were used in the example above since it perfectly visualizes the tree structure).

You can find a complete, working Expression example in the MiniPL project. You can find there an Expr domain, and four grammars for it: GrammarInfix, GrammarPrefix, GrammarPostfix and GrammarFunc (function-style).

So, the problem with unparsing an expression in the infix notation is that you have to parenthesize the expressions to preserve and visualize the "scopes" of the operators that were encoded into the tree structure.

The unparsed text representation of the Add(Mul(1, 2), 3) expression is "1 * 2 + 3", so no parentheses are needed.

On the other hand, the unparsed text representation of the Mul(Add(1, 2), 3) expression is "(1 + 2) * 3", so parentheses are needed here, since "1 + 2 * 3" would mean Add(1, Mul(2, 3)), which is different than our original expression.

If we want to deal with (it means parsing or unparsing) the text representation of an expression in infix notation then we should assign precedences and associativities to the operators (note that if the expression is just represented in the tree structure, or in postfix, prefix or function-style notation in text format, then the precedences and associativities are not needed at all), and use parentheses to "override" the default precedences and associativities if needed.

Regarding the precedences and associativities there are two main ways of defining expressions in a grammar: the implicit and the explicit way.

Implicit precedence and associativity

The precedence and associativity of the operators are implicitly coded into the grammar by the "depth" and order of the bnfterms (domain-grammar bindings are omitted for brevity):

Expression.Rule =
    Expression + ADD_OP + Term
    |
    Expression + SUB_OP + Term
    |
    Term;

Term.Rule =
    Term + MUL_OP + Factor
    |
    Term + DIV_OP + Factor
    |
    Factor;

Factor.Rule =
    Exponent + POW_OP + Factor
    |
    Exponent;

Exponent.Rule =
    LEFT_PAREN + Expression + RIGHT_PAREN
    |
    NumberLiteral;

In this case the unparser (just as the parser) can do its job without further information, and the expression will be properly parenthesized.

Explicit precedence and associativity

The precedence and associativity of the operators are explicitly given as an extra information to the grammar (domain-grammar bindings are omitted for brevity):

Expression.Rule =
    Expression + ADD_OP + Expression
    |
    Expression + SUB_OP + Expression
    |
    Expression + MUL_OP + Expression
    |
    Expression + DIV_OP + Expression
    |
    Expression + POW_OP + Expression
    |
    LEFT_PAREN + Expression + RIGHT_PAREN
    |
    NumberLiteral;

RegisterOperators(10, Associativity.Left, ADD_OP, SUB_OP);
RegisterOperators(20, Associativity.Left, MUL_OP, DIV_OP);
RegisterOperators(30, Associativity.Right, POW_OP);

In this case the unparser (just as the parser) needs the extra information provided by RegisterOperators to do its job and properly parenthesize the expressions. However, the unparser needs another extra information to know whether what are the parentheses keyterms that belong to Expression bnfterm. This can be done by overriding the GetUnparseControl method in the grammar, and creating a ParenthesizedExpression to provide the information for the unparser:

protected override UnparseControl GetUnparseControl()
{
    return UnparseControl.Create(
        this,
        new Formatter(this),
        new ParenthesizedExpression(Expression, LEFT_PAREN, RIGHT_PAREN)
        );
}

If you are interested in how to format the unparser's output, continue with Formatters.