Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to lex Python-style indentation using INDENT and DEDENT tokens #55

Open
juliankrispel opened this issue Jul 15, 2017 · 14 comments
Open
Labels

Comments

@juliankrispel
Copy link

juliankrispel commented Jul 15, 2017

I'm guessing moo isn't designed for this but I thought I'd ask anyway in case I'm missing something.

I basically would like to implement whitespace sensitive block scoping, like python, yaml, coffeescript etc...

But I've no idea where to start

@deltaidea
Copy link
Contributor

The goal of a tokenizer is to give you a long flat list of tokens. Words "block" and "scoping" are on higher abstraction levels, where you iterate over the tokens and assemble a nested structure known as syntax tree. A simple solution comes to mind, but maybe not the best:

  • Parse consecutive spaces into a single token.
  • When you iterate and see a newline token followed by spaces, compare their length with remembered length of spaces on the previous line and decide whether you need to create a new object in the tree.
  • Once you have a tree, you can walk over it to transform, lint, compile, whatever.

@juliankrispel
Copy link
Author

@deltaidea I know what a lexer is :/ no need to explain.

But lexers can also be context aware. For example, let's say I capture the whitespace at the beginning of each line as indent.

What I'd then want to do is have a token for INDENT_OUT and INDENT_IN. That'd make building a whitespace sensitive grammar a lot easier.

@nathan
Copy link
Collaborator

nathan commented Jul 15, 2017

Here ya go.

Copied to a gist too.

const moo = require('moo')

const lexer = moo.compile({
  ws: /[ \t]+/,
  nl: { match: /(?:\r\n?|\n)+/, lineBreaks: true },
  id: /\w+/,
})

// example

const tokens = indented(lexer, `
if this
  if that
    another
else
  there
`)

for (const tok of tokens) console.log(tok)

// implementation

function* indented(lexer, source) {
  let iter = peekable(lexer.reset(source))
  let stack = []

  // absorb initial blank lines and indentation
  let indent = iter.nextIndent()

  for (let tok; tok = iter.next(); ) {
    if (tok.type === 'nl') {
      const newIndent = iter.nextIndent()
      if (newIndent == null) break // eof

      if (newIndent === indent) {
        yield {type: 'nl'}

      } else if (newIndent > indent) {
        stack.push(indent)
        indent = newIndent
        yield {type: 'indent'}

      } else {
        while (newIndent < indent) {
          indent = stack.pop()
          yield {type: 'dedent'}
        }
        if (newIndent !== indent) {
          throw new Error('inconsistent indentation')
        }
      }
      indent = newIndent

    // ignore whitespace within lines
    } else if (tok.type !== 'ws') {
      yield tok
    }
  }

  // dedent remaining blocks at eof
  for (let i = stack.length; i--;) {
    yield {type: 'dedent'}
  }
}

function peekable(lexer) {
  let here = lexer.next()
  return {
    next() {
      const old = here
      here = lexer.next()
      return old
    },
    peek() {
      return here
    },
    nextIndent() {
      for (let tok; tok = this.peek(); ) {
        if (tok.type === 'nl') {
          this.next()
          continue
        }
        if (tok.type === 'ws') {
          const indent = tok.value.length
          this.next()

          const next = this.peek()
          if (!next) return
          if (next.type === 'nl') {
            this.next()
            continue
          }
          return indent
        }
        return 0
      }
    },
  }
}

@juliankrispel
Copy link
Author

Holy crap @nathan - thank you 😍

I wasn't ecpecting something fleshed out. I hope it was a copy and paste job. Thank you!

For anyone interested, this is what I'm working on - https://github.com/juliankrispel/bishbosh

Still looking for feedback btw ✌️

@tjvr
Copy link
Collaborator

tjvr commented Jul 15, 2017

Leaving this open, because I'd like to save your code somewhere @nathan for my own use if nothing else! :-)

@nathan
Copy link
Collaborator

nathan commented Jul 15, 2017

thank you

np <3

@tjvr gist and public domain

@tjvr tjvr added the question label Jul 20, 2017
@tjvr tjvr changed the title Any tips on how to implement a tokenizer that is context-sensitive How to lex Python-style indentation using INDENT and DEDENT tokens Jul 20, 2017
@tjvr
Copy link
Collaborator

tjvr commented Jul 24, 2017

Copying this here because I'm not sure if gist comment notifications are a thing:

I’m curious, why did you choose to consume newline tokens when they’re before an indent/dedent?

Notably, Python doesn’t do this.

@nathan
Copy link
Collaborator

nathan commented Jul 24, 2017

I'm not sure if gist comment notifications are a thing

Apparently they are not.

why did you choose to consume newline tokens when they’re before an indent/dedent?

If you want newline to act as a terminator, you probably shouldn't do this; if you want it to act as a separator (as I usually do), this is what you want. Especially in the case of indent, I find it more natural to parse:

'if' expr ':' indent stmt [nl stmt]* dedent

than

'if' expr ':' nl indent stmt nl [stmt nl]* dedent

(I usually write my parsers by hand. If you use a parser generator then this probably doesn't make much of a difference to you)

@JoshuaGrams
Copy link
Contributor

I needed something to use with Nearley, so here's a quick-and-dirty implementation which wraps a lexer in another lexer: https://gist.github.com/JoshuaGrams/84acba3f58410f9cef2d496d85bfa173

It doesn't do anything on save/reset: maybe it should save and reset the indentation stack? I'm not clear how much state is supposed to be reset, since it looks like moo resets the line/char numbers but leaves the state stack alone...?

@tjvr
Copy link
Collaborator

tjvr commented Sep 17, 2017

@JoshuaGrams Nathan's implementation looks more comprehensive than yours, although I realise it doesn't conform to the Lexer API that Nearley expects.

Note that unless you're using Nearley's rewinding features (which I should really document), you don't to implement save(): in particular, the info argument to reset(chunk, info) will always be undefined. (I should document this too!).

I hope that helps a little bit. Shout if that's unclear, or you run into any issues. :)


it looks like moo resets the line/char numbers but leaves the state stack alone

That sounds like a bug! I opened #75 to track this.

Thanks for reporting this! :)

@tjvr
Copy link
Collaborator

tjvr commented Sep 17, 2017

which wraps a lexer in another lexer

FWIW, we should probably come up with a nicer way for wrapping Moo Lexers. Perhaps just some kind of helper for constructing a subclass.

@JoshuaGrams
Copy link
Contributor

Oh yeah, the actual indentation-handling part of mine is still half-baked: I just thought it might be nice to have some sort of example of keeping the moo interface intact. And even there...hmm. I should probably just have called Object.create on the lexer and replaced the next method? I'll update it as I get it working better.

@danielo515
Copy link

I having a real bad time trying to make a grammar sensitive to whitespace.
This is what I got so far: https://github.com/danielo515/packages/tree/feature/whitespaceLexer/packages/fucc-script/src

I stolen some pieces of code from @juliankrispel , but the grammar I come so far it's very ambiguos and it generates too much results. Obviously I'm a total noob and I'm not sure how to properly specify things that can be one line or several lines.
My actual grammar is on grammar-mo.ne, so please ignore grammar.ne.
If anyone wants to help me in any way, I really apreaciate it.

@aliclark
Copy link

I've created a full-featured module for this: https://github.com/aliclark/moo-indentation-lexer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants