How to lex Python-style indentation using INDENT and DEDENT tokens #55

juliankrispel · 2017-07-15T10:35:59Z

I'm guessing moo isn't designed for this but I thought I'd ask anyway in case I'm missing something.

I basically would like to implement whitespace sensitive block scoping, like python, yaml, coffeescript etc...

But I've no idea where to start

deltaidea · 2017-07-15T12:30:14Z

The goal of a tokenizer is to give you a long flat list of tokens. Words "block" and "scoping" are on higher abstraction levels, where you iterate over the tokens and assemble a nested structure known as syntax tree. A simple solution comes to mind, but maybe not the best:

Parse consecutive spaces into a single token.
When you iterate and see a newline token followed by spaces, compare their length with remembered length of spaces on the previous line and decide whether you need to create a new object in the tree.
Once you have a tree, you can walk over it to transform, lint, compile, whatever.

juliankrispel · 2017-07-15T13:51:15Z

@deltaidea I know what a lexer is :/ no need to explain.

But lexers can also be context aware. For example, let's say I capture the whitespace at the beginning of each line as indent.

What I'd then want to do is have a token for INDENT_OUT and INDENT_IN. That'd make building a whitespace sensitive grammar a lot easier.

nathan · 2017-07-15T19:20:18Z

Here ya go.

Copied to a gist too.

const moo = require('moo')

const lexer = moo.compile({
  ws: /[ \t]+/,
  nl: { match: /(?:\r\n?|\n)+/, lineBreaks: true },
  id: /\w+/,
})

// example

const tokens = indented(lexer, `
if this
  if that
    another
else
  there
`)

for (const tok of tokens) console.log(tok)

// implementation

function* indented(lexer, source) {
  let iter = peekable(lexer.reset(source))
  let stack = []

  // absorb initial blank lines and indentation
  let indent = iter.nextIndent()

  for (let tok; tok = iter.next(); ) {
    if (tok.type === 'nl') {
      const newIndent = iter.nextIndent()
      if (newIndent == null) break // eof

      if (newIndent === indent) {
        yield {type: 'nl'}

      } else if (newIndent > indent) {
        stack.push(indent)
        indent = newIndent
        yield {type: 'indent'}

      } else {
        while (newIndent < indent) {
          indent = stack.pop()
          yield {type: 'dedent'}
        }
        if (newIndent !== indent) {
          throw new Error('inconsistent indentation')
        }
      }
      indent = newIndent

    // ignore whitespace within lines
    } else if (tok.type !== 'ws') {
      yield tok
    }
  }

  // dedent remaining blocks at eof
  for (let i = stack.length; i--;) {
    yield {type: 'dedent'}
  }
}

function peekable(lexer) {
  let here = lexer.next()
  return {
    next() {
      const old = here
      here = lexer.next()
      return old
    },
    peek() {
      return here
    },
    nextIndent() {
      for (let tok; tok = this.peek(); ) {
        if (tok.type === 'nl') {
          this.next()
          continue
        }
        if (tok.type === 'ws') {
          const indent = tok.value.length
          this.next()

          const next = this.peek()
          if (!next) return
          if (next.type === 'nl') {
            this.next()
            continue
          }
          return indent
        }
        return 0
      }
    },
  }
}

juliankrispel · 2017-07-15T19:27:26Z

Holy crap @nathan - thank you 😍

I wasn't ecpecting something fleshed out. I hope it was a copy and paste job. Thank you!

For anyone interested, this is what I'm working on - https://github.com/juliankrispel/bishbosh

Still looking for feedback btw ✌️

tjvr · 2017-07-15T20:09:26Z

Leaving this open, because I'd like to save your code somewhere @nathan for my own use if nothing else! :-)

nathan · 2017-07-15T20:28:06Z

thank you

np <3

@tjvr gist and public domain

tjvr · 2017-07-24T20:35:40Z

Copying this here because I'm not sure if gist comment notifications are a thing:

I’m curious, why did you choose to consume newline tokens when they’re before an indent/dedent?

Notably, Python doesn’t do this.

nathan · 2017-07-24T20:49:03Z

I'm not sure if gist comment notifications are a thing

Apparently they are not.

why did you choose to consume newline tokens when they’re before an indent/dedent?

If you want newline to act as a terminator, you probably shouldn't do this; if you want it to act as a separator (as I usually do), this is what you want. Especially in the case of indent, I find it more natural to parse:

'if' expr ':' indent stmt [nl stmt]* dedent

than

'if' expr ':' nl indent stmt nl [stmt nl]* dedent

(I usually write my parsers by hand. If you use a parser generator then this probably doesn't make much of a difference to you)

JoshuaGrams · 2017-09-16T21:12:15Z

I needed something to use with Nearley, so here's a quick-and-dirty implementation which wraps a lexer in another lexer: https://gist.github.com/JoshuaGrams/84acba3f58410f9cef2d496d85bfa173

It doesn't do anything on save/reset: maybe it should save and reset the indentation stack? I'm not clear how much state is supposed to be reset, since it looks like moo resets the line/char numbers but leaves the state stack alone...?

tjvr · 2017-09-17T12:45:02Z

@JoshuaGrams Nathan's implementation looks more comprehensive than yours, although I realise it doesn't conform to the Lexer API that Nearley expects.

Note that unless you're using Nearley's rewinding features (which I should really document), you don't to implement save(): in particular, the info argument to reset(chunk, info) will always be undefined. (I should document this too!).

I hope that helps a little bit. Shout if that's unclear, or you run into any issues. :)

it looks like moo resets the line/char numbers but leaves the state stack alone

That sounds like a bug! I opened #75 to track this.

Thanks for reporting this! :)

tjvr · 2017-09-17T12:45:44Z

which wraps a lexer in another lexer

FWIW, we should probably come up with a nicer way for wrapping Moo Lexers. Perhaps just some kind of helper for constructing a subclass.

JoshuaGrams · 2017-09-18T09:19:27Z

Oh yeah, the actual indentation-handling part of mine is still half-baked: I just thought it might be nice to have some sort of example of keeping the moo interface intact. And even there...hmm. I should probably just have called Object.create on the lexer and replaced the next method? I'll update it as I get it working better.

danielo515 · 2018-07-08T20:37:01Z

I having a real bad time trying to make a grammar sensitive to whitespace.
This is what I got so far: https://github.com/danielo515/packages/tree/feature/whitespaceLexer/packages/fucc-script/src

I stolen some pieces of code from @juliankrispel , but the grammar I come so far it's very ambiguos and it generates too much results. Obviously I'm a total noob and I'm not sure how to properly specify things that can be one line or several lines.
My actual grammar is on grammar-mo.ne, so please ignore grammar.ne.
If anyone wants to help me in any way, I really apreaciate it.

aliclark · 2020-08-19T20:47:49Z

I've created a full-featured module for this: https://github.com/aliclark/moo-indentation-lexer

juliankrispel closed this as completed Jul 15, 2017

tjvr reopened this Jul 15, 2017

tjvr mentioned this issue Jul 15, 2017

How would I go about writing a whitespace sensitive grammar with nearley? kach/nearley#251

Closed

tjvr added the question label Jul 20, 2017

tjvr changed the title ~~Any tips on how to implement a tokenizer that is context-sensitive~~ How to lex Python-style indentation using INDENT and DEDENT tokens Jul 20, 2017

tjvr mentioned this issue Jul 24, 2017

Add moo/indent wrapper #66

Open

tjvr mentioned this issue Sep 28, 2017

Lexer.has does not find error token. #76

Closed

nathan mentioned this issue Aug 25, 2022

Support for pop values higher than 1 #177

Closed

TekuConcept mentioned this issue Apr 2, 2024

An example showing significant whitespace? kach/nearley#643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to lex Python-style indentation using INDENT and DEDENT tokens #55

How to lex Python-style indentation using INDENT and DEDENT tokens #55

juliankrispel commented Jul 15, 2017 •

edited

Loading

deltaidea commented Jul 15, 2017

juliankrispel commented Jul 15, 2017

nathan commented Jul 15, 2017 •

edited

Loading

juliankrispel commented Jul 15, 2017

tjvr commented Jul 15, 2017

nathan commented Jul 15, 2017 •

edited

Loading

tjvr commented Jul 24, 2017

nathan commented Jul 24, 2017 •

edited

Loading

JoshuaGrams commented Sep 16, 2017

tjvr commented Sep 17, 2017

tjvr commented Sep 17, 2017

JoshuaGrams commented Sep 18, 2017

danielo515 commented Jul 8, 2018

aliclark commented Aug 19, 2020

How to lex Python-style indentation using INDENT and DEDENT tokens #55

How to lex Python-style indentation using INDENT and DEDENT tokens #55

Comments

juliankrispel commented Jul 15, 2017 • edited Loading

deltaidea commented Jul 15, 2017

juliankrispel commented Jul 15, 2017

nathan commented Jul 15, 2017 • edited Loading

juliankrispel commented Jul 15, 2017

tjvr commented Jul 15, 2017

nathan commented Jul 15, 2017 • edited Loading

tjvr commented Jul 24, 2017

nathan commented Jul 24, 2017 • edited Loading

JoshuaGrams commented Sep 16, 2017

tjvr commented Sep 17, 2017

tjvr commented Sep 17, 2017

JoshuaGrams commented Sep 18, 2017

danielo515 commented Jul 8, 2018

aliclark commented Aug 19, 2020

juliankrispel commented Jul 15, 2017 •

edited

Loading

nathan commented Jul 15, 2017 •

edited

Loading

nathan commented Jul 15, 2017 •

edited

Loading

nathan commented Jul 24, 2017 •

edited

Loading