-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to lex Python-style indentation using INDENT and DEDENT tokens #55
Comments
The goal of a tokenizer is to give you a long flat list of tokens. Words "block" and "scoping" are on higher abstraction levels, where you iterate over the tokens and assemble a nested structure known as syntax tree. A simple solution comes to mind, but maybe not the best:
|
@deltaidea I know what a lexer is :/ no need to explain. But lexers can also be context aware. For example, let's say I capture the whitespace at the beginning of each line as indent. What I'd then want to do is have a token for INDENT_OUT and INDENT_IN. That'd make building a whitespace sensitive grammar a lot easier. |
Here ya go. const moo = require('moo')
const lexer = moo.compile({
ws: /[ \t]+/,
nl: { match: /(?:\r\n?|\n)+/, lineBreaks: true },
id: /\w+/,
})
// example
const tokens = indented(lexer, `
if this
if that
another
else
there
`)
for (const tok of tokens) console.log(tok)
// implementation
function* indented(lexer, source) {
let iter = peekable(lexer.reset(source))
let stack = []
// absorb initial blank lines and indentation
let indent = iter.nextIndent()
for (let tok; tok = iter.next(); ) {
if (tok.type === 'nl') {
const newIndent = iter.nextIndent()
if (newIndent == null) break // eof
if (newIndent === indent) {
yield {type: 'nl'}
} else if (newIndent > indent) {
stack.push(indent)
indent = newIndent
yield {type: 'indent'}
} else {
while (newIndent < indent) {
indent = stack.pop()
yield {type: 'dedent'}
}
if (newIndent !== indent) {
throw new Error('inconsistent indentation')
}
}
indent = newIndent
// ignore whitespace within lines
} else if (tok.type !== 'ws') {
yield tok
}
}
// dedent remaining blocks at eof
for (let i = stack.length; i--;) {
yield {type: 'dedent'}
}
}
function peekable(lexer) {
let here = lexer.next()
return {
next() {
const old = here
here = lexer.next()
return old
},
peek() {
return here
},
nextIndent() {
for (let tok; tok = this.peek(); ) {
if (tok.type === 'nl') {
this.next()
continue
}
if (tok.type === 'ws') {
const indent = tok.value.length
this.next()
const next = this.peek()
if (!next) return
if (next.type === 'nl') {
this.next()
continue
}
return indent
}
return 0
}
},
}
} |
Holy crap @nathan - thank you 😍 I wasn't ecpecting something fleshed out. I hope it was a copy and paste job. Thank you! For anyone interested, this is what I'm working on - https://github.com/juliankrispel/bishbosh Still looking for feedback btw ✌️ |
Leaving this open, because I'd like to save your code somewhere @nathan for my own use if nothing else! :-) |
Copying this here because I'm not sure if gist comment notifications are a thing:
|
Apparently they are not.
If you want newline to act as a terminator, you probably shouldn't do this; if you want it to act as a separator (as I usually do), this is what you want. Especially in the case of
than
(I usually write my parsers by hand. If you use a parser generator then this probably doesn't make much of a difference to you) |
I needed something to use with Nearley, so here's a quick-and-dirty implementation which wraps a lexer in another lexer: https://gist.github.com/JoshuaGrams/84acba3f58410f9cef2d496d85bfa173 It doesn't do anything on save/reset: maybe it should save and reset the indentation stack? I'm not clear how much state is supposed to be reset, since it looks like moo resets the line/char numbers but leaves the state stack alone...? |
@JoshuaGrams Nathan's implementation looks more comprehensive than yours, although I realise it doesn't conform to the Lexer API that Nearley expects. Note that unless you're using Nearley's rewinding features (which I should really document), you don't to implement I hope that helps a little bit. Shout if that's unclear, or you run into any issues. :)
That sounds like a bug! I opened #75 to track this. Thanks for reporting this! :) |
FWIW, we should probably come up with a nicer way for wrapping Moo Lexers. Perhaps just some kind of helper for constructing a subclass. |
Oh yeah, the actual indentation-handling part of mine is still half-baked: I just thought it might be nice to have some sort of example of keeping the moo interface intact. And even there...hmm. I should probably just have called Object.create on the lexer and replaced the |
I having a real bad time trying to make a grammar sensitive to whitespace. I stolen some pieces of code from @juliankrispel , but the grammar I come so far it's very ambiguos and it generates too much results. Obviously I'm a total noob and I'm not sure how to properly specify things that can be one line or several lines. |
I've created a full-featured module for this: https://github.com/aliclark/moo-indentation-lexer |
I'm guessing moo isn't designed for this but I thought I'd ask anyway in case I'm missing something.
I basically would like to implement whitespace sensitive block scoping, like python, yaml, coffeescript etc...
But I've no idea where to start
The text was updated successfully, but these errors were encountered: