Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Fix String.split() implementations #3528

Open
wants to merge 151 commits into
base: main
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

@martinvuyk martinvuyk commented Sep 22, 2024

Main issue

Fix String.split() implementations to use a generic implementation and without assuming that indexing is by byte offset. Added all methods to StringLiteral and StringSlice. Some important optimizations were added by parametrizing and avoiding slicing with numeric tricks.

Changes in behavior

This PR changes split("") behavior to be non-raising and return the separated unicode characters analogous to when the whole string has the separator at start, end, and in between every character. Closes #3635

String, StringLiteral, and StringSlice .split() now return a List[StringSlice].

Benchmark results:

CPU: Intel® Core™ i7-7700HQ

improvement metric: markdown percentage improvement ((old_value - new_value) / old_value)

Average improvement for split with a sequence: 91.2486% . In orders of magnitude, this is a 11x improvement
Average improvement for split on any whitespace: 99.9975% . In orders of magnitude, this is a 40k x improvement

Name old_value (ms) new_value (ms) improvement
bench_string_split[1000000] 5.6970877871796 0.503624904955618 0.912486298354641
bench_string_split_none[1000000] 147157.6808882 3.78480079182398 0.999975287456096

@martinvuyk martinvuyk requested a review from a team as a code owner September 22, 2024 23:17
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk martinvuyk changed the title [stdlib] Fix split implementations [stdlib] Fix String.split() implementations Sep 23, 2024
Copy link
Contributor

@msaelices msaelices left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. I've just add a NIT-pick suggestion.

Also, is it possible to add a unit test?

Co-authored-by: Manuel Saelices <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk
Copy link
Contributor Author

Great job. I've just add a NIT-pick suggestion.

Also, is it possible to add a unit test?

Hi, thanks for the review. Any type of test in mind that split tescases don't cover ?

@msaelices
Copy link
Contributor

msaelices commented Sep 23, 2024

Great job. I've just add a NIT-pick suggestion.
Also, is it possible to add a unit test?

Hi, thanks for the review. Any type of test in mind that split tescases don't cover ?

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

@martinvuyk
Copy link
Contributor Author

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

I'm not understanding, so the lines from var res7 ... onwards didn't get merged in another PR and you'd like me to add them here?

@msaelices
Copy link
Contributor

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

I'm not understanding, so the lines from var res7 ... onwards didn't get merged in another PR and you'd like me to add them here?

It's just a diff if you want to complete it with more test. LGTM anyways so don't worry. Thanks for that contribution 🥇

Signed-off-by: martinvuyk <[email protected]>
@skongum02 skongum02 deleted the branch modular:main January 29, 2025 18:58
@skongum02 skongum02 closed this Jan 29, 2025
@skongum02 skongum02 reopened this Jan 29, 2025
@skongum02 skongum02 changed the base branch from nightly to main January 29, 2025 20:54
@martinvuyk martinvuyk marked this pull request as ready for review February 1, 2025 19:51
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk
Copy link
Contributor Author

martinvuyk commented Feb 22, 2025

@JoeLoser can we merge this? 5 months of keeping this on top of the tree and with so many changes to string stuff is laborious.

@ConnorGray I'm un-deprecating __iter__ here because it is needed and is simply an iterator over codepoint slices. Here you can also see many examples where the performance of building a Codepoint (encoding from utf8 to utf32) would be unnecessarily expensive.

My plan is to then on top of this, similar to #3858, build an iterator taking this code. Then just adding a layer which assembles it into a list.

modularbot pushed a commit that referenced this pull request Mar 19, 2025
[External] [CI] Fix benchmarks, clean up environment variables

This is a series of small changes:
1. I replaced all of the import path environment variables being passed
with `MODULAR_MOJO_MAX_IMPORT_PATH`. The others don't actually resolve
to anything in Mojo, so they were just noise. I deduplicated them where
applicable. I left the ones in the examples directory alone, for some
reason they broke things.
2. I removed the env var in the top level pixi file, shouldn't be needed
anymore
3. When looking for the precompiled package root, unconditionally look
in the `.magic` folder. This is one of the key changes that I think
fixes it

@martinvuyk - I tested this on your branch
(#3528) and got this:
```
******************** TEST 'Mojo Standard Library Benchmarks :: collections/bench_string.mojo' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 13: mojo /home/alextrotta/Documents/martin/mojo/mojo/stdlib/benchmarks/collections/bench_string.mojo -t
+ mojo /home/alextrotta/Documents/martin/mojo/mojo/stdlib/benchmarks/collections/bench_string.mojo -t
/home/alextrotta/Documents/martin/mojo/mojo/stdlib/benchmarks/collections/bench_string.mojo:117:30: error: cannot implicitly convert 'List[StringSlice[(muttoimm items)]]' value to 'List[StringSlice[items]]'
            res = items.split(sequence.value())
                  ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
/home/alextrotta/Documents/martin/mojo/mojo/stdlib/benchmarks/collections/bench_string.mojo:119:30: error: cannot implicitly convert 'List[StringSlice[(muttoimm items)]]' value to 'List[StringSlice[items]]'
            res = items.split()
                  ~~~~~~~~~~~^~
mojo: error: failed to parse the provided Mojo source module

--
```
but I think this is due to your change, so you just need to update the
benchmarking script.

Co-authored-by: Alex Trotta <[email protected]>
Closes #4149
MODULAR_ORIG_COMMIT_REV_ID: 1ee251ef31f8b011c9eb820a7843b10c87964d40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Make String.split with sep non-raising
7 participants