You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: allow completing buffer words with unicode (#392)
Issue
=====
When using the "buffer" provider in a buffer with the following
contents:
```
työmaa
| <- cursor
```
When typing "ty" the word "työmaa" is not matched, even though it's part
of the buffer and should be shown as a suggestion.
> Trivia: työmaa is Finnish and can be translated as "construction site"
> in English. 🙂 The point is to have a word with non-ASCII characters.
Solution
========
Change the regex pattern to match any word character, including unicode
characters.
The change allows matching the word "työmaa" when typing "ty" in a
buffer.
Considerations
==============
Looking at the Unicode section on rust regular expressions
https://docs.rs/regex/latest/regex/#unicode
For maximum performance, it's recommended to stick to ASCII
characters when possible. However, I'm using the following regex in
blink-ripgrep.nvim to fetch completions for all the files in the
project:
```lua
-- completions are fetched using rg (ripgrep), which is written in rust
-- and I suppose also uses the regex crate.
return {
"rg",
"--no-config",
"--json",
"--context=" .. (opts.context_size or 5),
"--word-regexp",
"--max-filesize=" .. (opts.max_filesize or "1M"),
"--ignore-case",
"--",
prefix .. "[\\w_-]+", -- 👈🏻 notice the usage of \w
vim.fn.fnameescape(vim.fs.root(0, ".git") or vim.fn.getcwd()),
}
```
https://github.com/mikavilpas/blink-cmp-rg.nvim/blob/dbbfb4d94432f82757bc38facbf87566f6bbd67c/lua/blink-ripgrep/init.lua?plain=1#L72
The pattern is evaluated against all lines in every file in the current
project, and I have found performance to be very good.
I also ran the proposed pattern in a large codebase I work in. I ran
this in a project with 4154 files, totalling 815283 lines of code
(calculated with the `tokei` cli application, v. 12.1.2).
`rg` is able to search for this regex pattern with the following
results:
```sh
$ hyperfine 'rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null'
Benchmark 1: rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null
Time (mean ± σ): 194.2 ms ± 7.7 ms [User: 120.8 ms, System: 316.0 ms]
Range (min … max): 185.6 ms … 214.7 ms 14 runs
```
My guess is that since the buffer provider is only used in the current
buffer, the performance impact should be minimal.
0 commit comments