-
-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document how to parse indentation based grammars #20
Comments
You cannot perform look-behind with Participle. If you're trying to parse YAML, you will almost certainly need a custom lexer that includes indentation as a first-class token. If you could describe and give examples of both situations where the |
The The yaml like array is something I woud like to be able to have. In that case of arrays a look behind would be nice or as you mentioned a way to check the identation to see if I am inside an array. |
I more meant that if you show me an example of what you're trying to achieve I may be able to provide you with a grammar that will parse it. |
Oh ok. I can do that later this day. |
So my current code is at https://github.com/Nordgedanken/Matrix-DSL I tried to have a custom lexer based on participle but as you mentioned look-behind failed (nil pointers^^). You can find the gramar I am using at https://github.com/Nordgedanken/Matrix-DSL/blob/master/cmd/lexer/ast.go The full example I currently have is:
So lets break up how I wanted to have this parsed:
|
I hope thats detailed enough or atleast the right thing |
If I understand correctly, this will not be possible to achieve with Participle because it isn't a context-free grammar (CFG). It looks similar to YAML, which requires context across lines in order to track the current indentation level. I will update the README to clarify that Participle can only be used to parse CFGs. Sorry about that :( |
Actually I'm going to correct myself here and go back to my original suggestion: this will require a custom lexer. Basically, each time the text is indented the lexer will need to emit an The grammar will then need to match against these tokens. eg. (this will not work but might give you a starting point) type Entry struct {
Key string `@Ident ":"`
Value *Value `@@`
}
type Value struct {
Array []*Value `Indent @@ { @@ } Unindent`
Boolean *bool `| ( @"true" | "false")`
} You might also need #3 to be solved. |
@alecthomas I will try that. Thanks for the suggestion and for the help! |
In case it's helpful, I've written lexers for a few indentation-sensitive languages, and found a pattern that I found helpful: I usually first write a "normal" lexer which (unusually) considers a newline as an explicit token and considers a sequence of consecutive other whitespace characters as an explicit token. This might give you a token stream like this for part of the example above:
With that lexer working, you can then wrap it in another lexer that consumes tokens from the first while tracking a stack of currently-open indent levels, passing through verbatim any tokens that aren't It's important to keep a stack of indent levels that are "open" so you can generate the right number of synthetic I've not tried to do the above with I'm sure some or all of the above will be familiar to some readers but I thought I'd leave this here in case it's useful to someone who finds this issue in future. Doing the indent processing separately means that a more "normal" lexer (e.g. EBNF-based) can be used for most of the scanning, as long as it doesn't filter out the whitespace when generating tokens. |
@apparentlymart Nice idea about composing the lexers 👍 It seems like it should be possible to provide a generic |
That could be useful, indeed! Another detail I meant to mention: indent-aware grammars often have certain syntax constructs where indentation-awareness is temporarily suspended. For example, Python's lexer suspends generation of This can also be handled in a wrapping lexer like I was describing, but it requires a little more state tracking (a stack of open brackets, or just a count if you're not concerned about handling incorrect nesting) and recognizing those open and close bracket tokens. The hypothetical |
@apparentlymart thanks. Took me a few tries to figure it out. If someone wants to know how a basic one is done. Its here https://github.com/pyros2097/pine/blob/master/compiler/lexer.go. Its still work in progress though and is not final. |
I have a need for an indenting lexer in a side project I'm working on. It seems to work quite well so far. Once it's a bit more battle tested I think I'll include it in Participle itself, but for now here it is: // Package indenter provides a lexer that inserts INDENT and DEDENT
// tokens based on indentation.
//
// It relies on the parent lexer to provide an "NL" token that matches
// newlines followed by whitespace.
package indenter
import (
"io"
"maps"
"strings"
"github.com/alecthomas/participle/v2/lexer"
)
func New(parent lexer.Definition) lexer.Definition {
out := maps.Clone(parent.Symbols())
if _, ok := out["NL"]; !ok {
panic("parent lexer must have an NL symbol similar to \"[\\n\\r]+[\\s\\t]*\"")
}
if _, ok := out["Indent"]; ok {
panic("parent lexer must not have an Indent symbol")
}
if _, ok := out["Dedent"]; ok {
panic("parent lexer must not have a Dedent symbol")
}
next := lexer.TokenType(-1)
for _, t := range out {
if t <= next {
next = t - 1
}
}
out["Indent"] = next
out["Dedent"] = next - 1
return &indentLexerDef{parent: parent, symbols: out}
}
type indentLexerDef struct {
parent lexer.Definition
symbols map[string]lexer.TokenType
}
func (i *indentLexerDef) Symbols() map[string]lexer.TokenType {
return i.symbols
}
func (i *indentLexerDef) Lex(filename string, r io.Reader) (lexer.Lexer, error) {
lex, err := i.parent.Lex(filename, r)
if err != nil {
return nil, err
}
return &indentLexer{
lexer: lex,
nlType: i.symbols["NL"],
indentType: i.symbols["Indent"],
dedentType: i.symbols["Dedent"],
}, nil
}
var _ lexer.Definition = (*indentLexerDef)(nil)
type indentLexer struct {
nlType lexer.TokenType
indentType lexer.TokenType
dedentType lexer.TokenType
indents []string
buffered []lexer.Token
lexer lexer.Lexer
}
func (i *indentLexer) Next() (lexer.Token, error) {
if len(i.buffered) > 0 {
t := i.buffered[0]
i.buffered = i.buffered[1:]
return t, nil
}
t, err := i.lexer.Next()
if err != nil || t.Type != i.nlType {
return t, err
}
pos := t.Pos
i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.nlType, Value: "\n"})
pos.Line++
pos.Column = 1
// At this point we have an NL symbol. Figure out if we need to
// insert Indent tokens or Dedent tokens.
indent := strings.TrimLeft(t.Value, "\n\r")
for j, ind := range i.indents {
if strings.HasPrefix(indent, ind) {
indent = indent[len(ind):]
pos.Column += len(ind)
pos.Offset += len(ind)
} else {
k := len(i.indents) - j
i.indents = i.indents[:j]
for ; k > 0; k-- {
i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.dedentType, Value: "⇤"})
}
break
}
}
if indent != "" {
i.indents = append(i.indents, indent)
i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.indentType, Value: "⇥"})
}
t = i.buffered[0]
i.buffered = i.buffered[1:]
return t, nil
}
var _ lexer.Lexer = (*indentLexer)(nil) |
Thanks already for that @alecthomas — and a question on this:
How viable for inclusion in Participle do you feel this is by now? |
Hi I want to check if the prevoius char is a
:
. The exact example code I have is this:I want basicly to be able to check based on if the previous char is a
:
the!
should get a string and later reuse the look back logic to check if I am inside an array which is defined like in yaml meaning to make sure the last 2 lines of the following example don't get into the array:(Commands defines a Array with 2 objects in it which both have two key value pairs in them. HomeServer and HomeRoom are outside of that array)
The text was updated successfully, but these errors were encountered: