Document how to parse indentation based grammars #20

MTRNord · 2018-04-25T15:35:58Z

Hi I want to check if the prevoius char is a :. The exact example code I have is this:

Prefix: !

I want basicly to be able to check based on if the previous char is a : the ! should get a string and later reuse the look back logic to check if I am inside an array which is defined like in yaml meaning to make sure the last 2 lines of the following example don't get into the array:

Commands:
    - Trigger: afk
      Response: "{{user}} just went afk"
    - Trigger: [join_event]
      Response: "Welcome to the Room {{last_joined_user}}"
Homeserver: matrix.org
HomeRoom: {{name}}

(Commands defines a Array with 2 objects in it which both have two key value pairs in them. HomeServer and HomeRoom are outside of that array)

The text was updated successfully, but these errors were encountered:

alecthomas · 2018-04-27T07:56:34Z

You cannot perform look-behind with Participle.

If you're trying to parse YAML, you will almost certainly need a custom lexer that includes indentation as a first-class token.

If you could describe and give examples of both situations where the ! is used that require the look-behind, I might be able to help further.

MTRNord · 2018-04-27T09:26:30Z

The ! is just any char possible. I wanted to stringify anything after the : before a new line. But thats something I have no problem with if that is impossible.

The yaml like array is something I woud like to be able to have. In that case of arrays a look behind would be nice or as you mentioned a way to check the identation to see if I am inside an array.

alecthomas · 2018-04-27T09:51:26Z

I more meant that if you show me an example of what you're trying to achieve I may be able to provide you with a grammar that will parse it.

MTRNord · 2018-04-27T10:13:07Z

Oh ok. I can do that later this day.

MTRNord · 2018-04-27T18:06:53Z

So my current code is at https://github.com/Nordgedanken/Matrix-DSL I tried to have a custom lexer based on participle but as you mentioned look-behind failed (nil pointers^^). You can find the gramar I am using at https://github.com/Nordgedanken/Matrix-DSL/blob/master/cmd/lexer/ast.go

The full example I currently have is:

[[BOT]]
Name: Test Bot
Description: Some Demo Bot
Prefix: !
Autojoin: true
Commands:
    - Trigger: afk
      Response: {{user}} just went afk
    - Trigger: [join_event]
      Response: Welcome to the Room {{last_joined_user}}
Homeserver: matrix.org
HomeRoom: {{name}}

So lets break up how I wanted to have this parsed:

[[BOT]] is a header for the type of config. This already works
From "Name" to "Autojoin" there are 4 key value pairs which I want to parse into a string. This currently fails as the default lexer requires me to have " around strings.
"Commands" starts a Array. This is exact the same as yaml defines it. But with the same type of strings as mentioned in 2. The first line of a object has a - as a prefix. The following key value pairs inside the object are on the same "level" (have the same amount of spaces in front)
[join_event] are special keys that are predefined.
Strings with {{}} around them are also special keys but reflect either predefined variables or variables from the Config. (these can be parsed after the work of participle just fine as they get replaced anyway)
"Homeserver" and "HomeRoom" are like the pairs in 1 but are after the array means I need some kind of tracking if we left the array.

MTRNord · 2018-04-27T18:07:09Z

I hope thats detailed enough or atleast the right thing

alecthomas · 2018-04-30T10:44:46Z

If I understand correctly, this will not be possible to achieve with Participle because it isn't a context-free grammar (CFG). It looks similar to YAML, which requires context across lines in order to track the current indentation level.

I will update the README to clarify that Participle can only be used to parse CFGs.

Sorry about that :(

alecthomas · 2018-04-30T10:54:16Z

Actually I'm going to correct myself here and go back to my original suggestion: this will require a custom lexer.

Basically, each time the text is indented the lexer will need to emit an Indent token and update its state to track that indentation level. When unindentation occurs the lexer will need to emit an Unindent token.

The grammar will then need to match against these tokens.

eg. (this will not work but might give you a starting point)

type Entry struct {
  Key string `@Ident ":"`
  Value *Value `@@`
}

type Value struct {
  Array []*Value `Indent @@ { @@ } Unindent`
  Boolean *bool `| ( @"true" | "false")`
}

You might also need #3 to be solved.

MTRNord · 2018-04-30T11:53:10Z

@alecthomas I will try that. Thanks for the suggestion and for the help!

apparentlymart · 2018-07-18T03:02:38Z

In case it's helpful, I've written lexers for a few indentation-sensitive languages, and found a pattern that I found helpful:

I usually first write a "normal" lexer which (unusually) considers a newline as an explicit token and considers a sequence of consecutive other whitespace characters as an explicit token. This might give you a token stream like this for part of the example above:

Ident Commands
Punct :
Newline \n
Spaces (four)
Punct -
Ident Trigger
Punct :
Spaces (one)
Ident afk
Newline \n
Spaces (six)
Ident Response
(etc)

With that lexer working, you can then wrap it in another lexer that consumes tokens from the first while tracking a stack of currently-open indent levels, passing through verbatim any tokens that aren't Spaces and synthesizing Indent and Dedent tokens as needed between Newline tokens and whatever follows them.

It's important to keep a stack of indent levels that are "open" so you can generate the right number of synthetic Dedent tokens if multiple indented blocks all close at once.

I've not tried to do the above with participle.Lexer yet, but since its interface is to just pull one token at a time implementing the above shouldn't be too tricky with the indent stack and a state flag to deal with the need to "look behind" and see if the previous token was a newline.

I'm sure some or all of the above will be familiar to some readers but I thought I'd leave this here in case it's useful to someone who finds this issue in future. Doing the indent processing separately means that a more "normal" lexer (e.g. EBNF-based) can be used for most of the scanning, as long as it doesn't filter out the whitespace when generating tokens.

alecthomas · 2018-07-18T04:36:03Z

@apparentlymart Nice idea about composing the lexers 👍

It seems like it should be possible to provide a generic IndentAwareLexer using an approach like that, assuming the underlying lexer conforms to emitting Newline and Spaces tokens, as you describe.

apparentlymart · 2018-07-18T16:18:10Z

That could be useful, indeed! Another detail I meant to mention: indent-aware grammars often have certain syntax constructs where indentation-awareness is temporarily suspended.

For example, Python's lexer suspends generation of Indent and Dedent when there is a nonzero number of bracketing delimiters open.

This can also be handled in a wrapping lexer like I was describing, but it requires a little more state tracking (a stack of open brackets, or just a count if you're not concerned about handling incorrect nesting) and recognizing those open and close bracket tokens. The hypothetical IndentAwareLexer could potentially handle this too by taking in its configuration a set of token pairs that are to be treated as suspending the synthetic Indent and Dedent token emission.

pyrossh · 2020-03-07T18:47:49Z

@apparentlymart thanks. Took me a few tries to figure it out. If someone wants to know how a basic one is done. Its here https://github.com/pyros2097/pine/blob/master/compiler/lexer.go. Its still work in progress though and is not final.

alecthomas · 2023-08-14T01:56:27Z

I have a need for an indenting lexer in a side project I'm working on. It seems to work quite well so far. Once it's a bit more battle tested I think I'll include it in Participle itself, but for now here it is:

// Package indenter provides a lexer that inserts INDENT and DEDENT
// tokens based on indentation.
//
// It relies on the parent lexer to provide an "NL" token that matches
// newlines followed by whitespace.
package indenter

import (
	"io"
	"maps"
	"strings"

	"github.com/alecthomas/participle/v2/lexer"
)

func New(parent lexer.Definition) lexer.Definition {
	out := maps.Clone(parent.Symbols())
	if _, ok := out["NL"]; !ok {
		panic("parent lexer must have an NL symbol similar to \"[\\n\\r]+[\\s\\t]*\"")
	}
	if _, ok := out["Indent"]; ok {
		panic("parent lexer must not have an Indent symbol")
	}
	if _, ok := out["Dedent"]; ok {
		panic("parent lexer must not have a Dedent symbol")
	}
	next := lexer.TokenType(-1)
	for _, t := range out {
		if t <= next {
			next = t - 1
		}
	}
	out["Indent"] = next
	out["Dedent"] = next - 1
	return &indentLexerDef{parent: parent, symbols: out}
}

type indentLexerDef struct {
	parent  lexer.Definition
	symbols map[string]lexer.TokenType
}

func (i *indentLexerDef) Symbols() map[string]lexer.TokenType {
	return i.symbols
}

func (i *indentLexerDef) Lex(filename string, r io.Reader) (lexer.Lexer, error) {
	lex, err := i.parent.Lex(filename, r)
	if err != nil {
		return nil, err
	}
	return &indentLexer{
		lexer:      lex,
		nlType:     i.symbols["NL"],
		indentType: i.symbols["Indent"],
		dedentType: i.symbols["Dedent"],
	}, nil
}

var _ lexer.Definition = (*indentLexerDef)(nil)

type indentLexer struct {
	nlType     lexer.TokenType
	indentType lexer.TokenType
	dedentType lexer.TokenType
	indents    []string
	buffered   []lexer.Token
	lexer      lexer.Lexer
}

func (i *indentLexer) Next() (lexer.Token, error) {
	if len(i.buffered) > 0 {
		t := i.buffered[0]
		i.buffered = i.buffered[1:]
		return t, nil
	}

	t, err := i.lexer.Next()
	if err != nil || t.Type != i.nlType {
		return t, err
	}

	pos := t.Pos
	i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.nlType, Value: "\n"})
	pos.Line++
	pos.Column = 1

	// At this point we have an NL symbol. Figure out if we need to
	// insert Indent tokens or Dedent tokens.
	indent := strings.TrimLeft(t.Value, "\n\r")
	for j, ind := range i.indents {
		if strings.HasPrefix(indent, ind) {
			indent = indent[len(ind):]
			pos.Column += len(ind)
			pos.Offset += len(ind)
		} else {
			k := len(i.indents) - j
			i.indents = i.indents[:j]
			for ; k > 0; k-- {
				i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.dedentType, Value: "⇤"})
			}
			break
		}
	}
	if indent != "" {
		i.indents = append(i.indents, indent)
		i.buffered = append(i.buffered, lexer.Token{Pos: pos, Type: i.indentType, Value: "⇥"})
	}
	t = i.buffered[0]
	i.buffered = i.buffered[1:]
	return t, nil
}

var _ lexer.Lexer = (*indentLexer)(nil)

metaleap · 2024-07-12T17:13:59Z

Thanks already for that @alecthomas — and a question on this:

Once it's a bit more battle tested I think I'll include it in Participle itself

How viable for inclusion in Participle do you feel this is by now?

alecthomas closed this as completed Apr 30, 2018

alecthomas reopened this Apr 30, 2018

alecthomas closed this as completed May 7, 2018

alecthomas changed the title ~~How do I check if previous char is a :~~ Document how to parse indentation based grammars May 7, 2018

alecthomas reopened this May 7, 2018

This was referenced Sep 15, 2021

Clarify README: Does participle support custom lexers or not? #180

Closed

clarify README.md #181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document how to parse indentation based grammars #20

Document how to parse indentation based grammars #20

MTRNord commented Apr 25, 2018 •

edited

Loading

alecthomas commented Apr 27, 2018

MTRNord commented Apr 27, 2018

alecthomas commented Apr 27, 2018

MTRNord commented Apr 27, 2018

MTRNord commented Apr 27, 2018

MTRNord commented Apr 27, 2018

alecthomas commented Apr 30, 2018

alecthomas commented Apr 30, 2018

MTRNord commented Apr 30, 2018 •

edited

Loading

apparentlymart commented Jul 18, 2018 •

edited

Loading

alecthomas commented Jul 18, 2018 •

edited

Loading

apparentlymart commented Jul 18, 2018

pyrossh commented Mar 7, 2020

alecthomas commented Aug 14, 2023

metaleap commented Jul 12, 2024

Document how to parse indentation based grammars #20

Document how to parse indentation based grammars #20

Comments

MTRNord commented Apr 25, 2018 • edited Loading

alecthomas commented Apr 27, 2018

MTRNord commented Apr 27, 2018

alecthomas commented Apr 27, 2018

MTRNord commented Apr 27, 2018

MTRNord commented Apr 27, 2018

MTRNord commented Apr 27, 2018

alecthomas commented Apr 30, 2018

alecthomas commented Apr 30, 2018

MTRNord commented Apr 30, 2018 • edited Loading

apparentlymart commented Jul 18, 2018 • edited Loading

alecthomas commented Jul 18, 2018 • edited Loading

apparentlymart commented Jul 18, 2018

pyrossh commented Mar 7, 2020

alecthomas commented Aug 14, 2023

metaleap commented Jul 12, 2024

MTRNord commented Apr 25, 2018 •

edited

Loading

MTRNord commented Apr 30, 2018 •

edited

Loading

apparentlymart commented Jul 18, 2018 •

edited

Loading

alecthomas commented Jul 18, 2018 •

edited

Loading