-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Fix String.split()
implementations
#3528
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
String.split()
implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job. I've just add a NIT-pick suggestion.
Also, is it possible to add a unit test?
Co-authored-by: Manuel Saelices <[email protected]> Signed-off-by: martinvuyk <[email protected]>
Hi, thanks for the review. Any type of test in mind that split tescases don't cover ? |
Signed-off-by: martinvuyk <[email protected]>
I thought this check would be broken in ❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
assert_equal(res6[2], "долор")
assert_equal(res6[3], "сит")
assert_equal(res6[4], "амет")
+ var res7 = in6.split("м")
+ assert_equal(res7[0], "Лоре")
+ assert_equal(res7[1], " ипсу")
+ assert_equal(res7[2], " долор сит а")
+ assert_equal(res7[3], "ет") BTW, I still think it's a good test to add. |
I'm not understanding, so the lines from |
Signed-off-by: martinvuyk <[email protected]>
It's just a diff if you want to complete it with more test. LGTM anyways so don't worry. Thanks for that contribution 🥇 |
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
…o into fix-split-implementations
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@JoeLoser can we merge this? 5 months of keeping this on top of the tree and with so many changes to string stuff is laborious. @ConnorGray I'm un-deprecating My plan is to then on top of this, similar to #3858, build an iterator taking this code. Then just adding a layer which assembles it into a list. |
[External] [CI] Fix benchmarks, clean up environment variables This is a series of small changes: 1. I replaced all of the import path environment variables being passed with `MODULAR_MOJO_MAX_IMPORT_PATH`. The others don't actually resolve to anything in Mojo, so they were just noise. I deduplicated them where applicable. I left the ones in the examples directory alone, for some reason they broke things. 2. I removed the env var in the top level pixi file, shouldn't be needed anymore 3. When looking for the precompiled package root, unconditionally look in the `.magic` folder. This is one of the key changes that I think fixes it @martinvuyk - I tested this on your branch (#3528) and got this: ``` ******************** TEST 'Mojo Standard Library Benchmarks :: collections/bench_string.mojo' FAILED ******************** Exit Code: 1 Command Output (stderr): -- RUN: at line 13: mojo /home/alextrotta/Documents/martin/mojo/mojo/stdlib/benchmarks/collections/bench_string.mojo -t + mojo /home/alextrotta/Documents/martin/mojo/mojo/stdlib/benchmarks/collections/bench_string.mojo -t /home/alextrotta/Documents/martin/mojo/mojo/stdlib/benchmarks/collections/bench_string.mojo:117:30: error: cannot implicitly convert 'List[StringSlice[(muttoimm items)]]' value to 'List[StringSlice[items]]' res = items.split(sequence.value()) ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~ /home/alextrotta/Documents/martin/mojo/mojo/stdlib/benchmarks/collections/bench_string.mojo:119:30: error: cannot implicitly convert 'List[StringSlice[(muttoimm items)]]' value to 'List[StringSlice[items]]' res = items.split() ~~~~~~~~~~~^~ mojo: error: failed to parse the provided Mojo source module -- ``` but I think this is due to your change, so you just need to update the benchmarking script. Co-authored-by: Alex Trotta <[email protected]> Closes #4149 MODULAR_ORIG_COMMIT_REV_ID: 1ee251ef31f8b011c9eb820a7843b10c87964d40
Main issue
Fix
String.split()
implementations to use a generic implementation and without assuming that indexing is by byte offset. Added all methods toStringLiteral
andStringSlice
. Some important optimizations were added by parametrizing and avoiding slicing with numeric tricks.Changes in behavior
This PR changes
split("")
behavior to be non-raising and return the separated unicode characters analogous to when the whole string has the separator at start, end, and in between every character. Closes #3635String
,StringLiteral
, andStringSlice
.split()
now return aList[StringSlice]
.Benchmark results:
CPU: Intel® Core™ i7-7700HQ
improvement metric: markdown percentage improvement (
(old_value - new_value) / old_value
)Average improvement for split with a sequence: 91.2486% . In orders of magnitude, this is a 11x improvement
Average improvement for split on any whitespace: 99.9975% . In orders of magnitude, this is a 40k x improvement
bench_string_split[1000000]
bench_string_split_none[1000000]