feat: allow completing buffer words with unicode (#392)

mikavilpas · web-flow · commit e1d3e9d4a644 · 2024-11-29T01:29:47.000-05:00
Issue ===== When using the "buffer" provider in a buffer with the following contents: ``` työmaa | <- cursor ``` When typing "ty" the word "työmaa" is not matched, even though it's part of the buffer and should be shown as a suggestion. > Trivia: työmaa is Finnish and can be translated as "construction site" > in English. 🙂 The point is to have a word with non-ASCII characters. Solution ======== Change the regex pattern to match any word character, including unicode characters. The change allows matching the word "työmaa" when typing "ty" in a buffer. Considerations ============== Looking at the Unicode section on rust regular expressions https://docs.rs/regex/latest/regex/#unicode For maximum performance, it's recommended to stick to ASCII characters when possible. However, I'm using the following regex in blink-ripgrep.nvim to fetch completions for all the files in the project: ```lua -- completions are fetched using rg (ripgrep), which is written in rust -- and I suppose also uses the regex crate. return { "rg", "--no-config", "--json", "--context=" .. (opts.context_size or 5), "--word-regexp", "--max-filesize=" .. (opts.max_filesize or "1M"), "--ignore-case", "--", prefix .. "[\\w_-]+", -- 👈🏻 notice the usage of \w vim.fn.fnameescape(vim.fs.root(0, ".git") or vim.fn.getcwd()), } ``` https://github.com/mikavilpas/blink-cmp-rg.nvim/blob/dbbfb4d94432f82757bc38facbf87566f6bbd67c/lua/blink-ripgrep/init.lua?plain=1#L72 The pattern is evaluated against all lines in every file in the current project, and I have found performance to be very good. I also ran the proposed pattern in a large codebase I work in. I ran this in a project with 4154 files, totalling 815283 lines of code (calculated with the `tokei` cli application, v. 12.1.2). `rg` is able to search for this regex pattern with the following results: ```sh $ hyperfine 'rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null' Benchmark 1: rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null Time (mean ± σ): 194.2 ms ± 7.7 ms [User: 120.8 ms, System: 316.0 ms] Range (min … max): 185.6 ms … 214.7 ms 14 runs ``` My guess is that since the buffer provider is only used in the current buffer, the performance impact should be minimal.
diff --git a/lua/blink/cmp/fuzzy/lib.rs b/lua/blink/cmp/fuzzy/lib.rs
@@ -12,7 +12,7 @@ mod fuzzy;
 mod lsp_item;
 
 lazy_static! {
-    static ref REGEX: Regex = Regex::new(r"[A-Za-z][A-Za-z0-9_\\-]{2,32}").unwrap();
+    static ref REGEX: Regex = Regex::new(r"\w[\w0-9_\\-]{2,32}").unwrap();
     static ref FRECENCY: RwLock<Option<FrecencyTracker>> = RwLock::new(None);
 }
 

Original file line number	Diff line number	Diff line change
`@@ -12,7 +12,7 @@ mod fuzzy;`
`12`	`12`	`mod lsp_item;`
`13`	`13`
`14`	`14`	`lazy_static! {`
`15`		`- static ref REGEX: Regex = Regex::new(r"[A-Za-z][A-Za-z0-9_\\-]{2,32}").unwrap();`
	`15`	`+ static ref REGEX: Regex = Regex::new(r"\w[\w0-9_\\-]{2,32}").unwrap();`
`16`	`16`	`static ref FRECENCY: RwLock<Option<FrecencyTracker>> = RwLock::new(None);`
`17`	`17`	`}`
`18`	`18`