Include a `split` function (202X feature) #241

certik · 2020-10-27T19:14:26Z

Fortran 202X will include a new intrinsic split function:

It was approved, and then the API was changed after approving it. We should implement the latest approved version in stdlib, and play with it and ensure that the API looks good. And if we discover some improvements, we should propose them at the February 2021 Fortran Standards meeting.

Then once split becomes part of the next Fortran standard, we can have a section in stdlib called "backwards compatibility", where we can have a "reference implementation" of such new features, so that people can use them right away even if some compiler might not support them yet.

The text was updated successfully, but these errors were encountered:

milancurcic · 2020-11-01T01:44:15Z

I put an implementation here: https://github.com/milancurcic/fortran202x_split.

This is a "naive" implementation--I went for what seemed to me as the simplest solution. The split_first_last specific subroutine, which is the second form listed in 20-139, does much of the grunt work. I wrote it in a mostly functional style, so it does some unnecessary copies that we may want to refactor in an imperative style.

For tests so far I only used the three examples from 16.9.194 in 20-007. They seem okay. More tests will be needed at the time of PR for stdlib.

At this time, I'd like to get feedback on this before I prepare a PR for stdlib.

jvdp1 · 2020-11-06T21:15:11Z

Thank you for the implementation. I played a bit with it and it looks good to me. The API seemed a bit strange at the start.
I wonder if the version with pos is not a bit overlapping with the intrinsic function index (note: I understand that this is a implementation of the proposed standard)

certik · 2020-11-06T21:48:55Z

@jvdp1 thanks for the feedback. That is precisely why I suggested we do this, to get more experience about the API and perhaps propose some changes to it before it gets standardized. @jvdp1 how would you change the API to be more natural?

milancurcic · 2020-11-07T01:01:33Z

Here's my impression of the API.

Input dummy argument set is a scalar character string that contains all separators to be used for delimiting tokens. For example, if you pass ";, " as a value of set, then any of these characters will be used for delimiting. In this current definition, I don't think it's possible to delimit using a multiple-character string, although I can't think of a use case for this. It seems to me that a more intuitive API would be for set to be an array of characters, so:

character, intent(in) :: set(:)

instead of

character(*), intent(in) :: set

Then you'd pass it as [";", ",", " "] instead of ";, ". But this is I think more a matter of style than anything else.

I found the API to be idiomatic Fortran--subroutines with intent(in out) arguments for output--which is not in itself a bad thing and helps minimize unnecessary copies. I only wish there was also a convenience function for when a user only wants the tokens as a result and nothing else, and doesn't mind an extra copy. The interface would look like this:

pure function split(string, set) result(tokens)
  character(*), intent(in) :: string
  character(*), intent(in) :: set
  character(:), allocatable :: tokens(:)

and then you call it like this:

tokens = split(string, set)

which would allow a more functional style by passing split(string, set) as an argument to other functions.

So, to that end, I'd propose that in the stdlib we also include this 4th specific procedure, even if it ends up being non-standard, or an extension. So we'd have a total of 4 forms of split:

subroutine split(string, set, tokens, separator)
subroutine split(string, set, first, last)
subroutine split(string, set, pos, back)
function split(string, set)

ivan-pi · 2020-11-07T11:45:10Z

Unfortunately in Fortran you can not force functions and subroutines under the same interface. So the last split has to be named differently.

milancurcic · 2020-11-07T11:52:39Z

Yes, I remembered this rule later. If people desire this as a function, maybe string_tokens, or strtok to mirror the C analog.

jvdp1 · 2020-11-08T16:25:39Z

@jvdp1 thanks for the feedback. That is precisely why I suggested we do this, to get more experience about the API and perhaps propose some changes to it before it gets standardized. @jvdp1 how would you change the API to be more natural?

First I expected a function (similar to scan that has a similar interface (scan(string, set [,back [,kind]]). I was thus a bit surprised that it was a subroutine. However it does make sense with the different ouputs.

Second, I don't see the value of the return value pos, if you can get separator, or first. (Maybe efficiency?) I must miss something there (note that I didn't read all the discussions about this proposition). Furthermore this version of split is quite similar to the intrinsic function index.

Finally, I was wondering why last was not optional. Isn't it that last(1:ntokens-1)=fist(2:ntokens)-1 (or somehting like that)? Do I miss something?

Anyway, this version would be already a great addition in stdlib and later in the Standard.

milancurcic · 2020-11-09T06:06:36Z

Yes, pos is useful for efficiency if you're searching for a delimiter at a specific position in the string. Both other forms parse the whole string.

You can't predict the values of last based on values of first because if token(n) is an empty string, last(n) is first(n)-1.

certik · 2020-11-09T06:32:34Z

We should also do comparisons with Python and other languages, as we usually do for stdlib.

…

On Sun, Nov 8, 2020, at 11:06 PM, Milan Curcic wrote: Yes, `pos` is useful for efficiency if you're searching for a delimiter at a specific position in the string. Both other forms parse the whole string. You can't predict the values of `last` based on values of `first` because if `token(n)` is an empty string, `last(n)` is `first(n)-1`. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#241 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAFAWHVKFSQ2PSCCE7OD43SO6BHTANCNFSM4TBIUC4Q>.

milancurcic · 2020-11-09T19:48:47Z

Similar capability in other languages:

milancurcic · 2020-11-09T19:59:19Z

The key differences seem to be:

Behavior of the separator. Fortran split works like that of C and C++ strtok (multiple characters in a string are possible separators), whereas in Python, Julia, MATLAB, and Go, the whole delimiter string is used as a single delimiter. Rust's delimiter allows specifying a pattern. For a few of these implementations, delimiter defaults to an empty space if omitted;
Result value. Fortran split updates the in out argument in-place. strtok returns a pointer to the beginning of the tokens sequence. Others simply return a list or array of strings as function result.
Unlike others, Fortran split allows returning first and last indices instead of tokens. If this functionality exists in other languages, it's defined in some other function. So, Fortran here packs multiple functionalities under one name, a design that I'm not fond of, but can get used to it.

certik · 2020-11-10T00:10:49Z

Thanks @milancurcic, very helpful, exactly what I was looking for. The first example:

https://github.com/milancurcic/fortran202x_split/blob/a49ccf5b0775732cc4d1087c71fd513d4e921a6a/test/main.f90#L13

I think is natural. That corresponds to the function split_tokens:

https://github.com/milancurcic/fortran202x_split/blob/a49ccf5b0775732cc4d1087c71fd513d4e921a6a/src/fortran202x_split.f90#L14

So I think I like the API of split_tokens.

I am not a fond of bundling the other functionality into the same split overload.
In terms of usability, I think the split_tokens will be the most useful and helpful. The other split_first_last and split_pos might not be as useful, it's hard to tell. I would feel much better if they could be used in stdlib in the experimental section first and see.

jvdp1 · 2020-11-10T13:45:07Z

I am also not fond of the current API, but even like that, it would be a nice addition.
I would suggest to add the current version in the experimental namespace of stdlib, such that people can test it.

milancurcic · 2020-11-11T00:20:02Z

I added the string_tokens function which is a thin wrapper around split_tokens.

This allows the user to do:

tokens = string_tokens(string, set)

There is also a simple benchmark program in app/main.f90 to compare the run-time between string_tokens function and the equivalent split_tokens subroutine, given a decent size input string (data included in the repo). On my laptop:

$ fpm run
 split subroutine, elapsed  0.482727677     seconds
 string_tokens function, elapsed  0.565418363     seconds

jvdp1 · 2020-11-11T06:21:43Z

Thank you @milancurcic for the function and the test. Can the difference be explained by calling the subroutine inside the function? Or was the subroutine inlined in the function?

milancurcic · 2020-11-11T15:49:39Z

The subroutine is not inlined:

  pure function string_tokens(string, set) result(tokens)
    !! Splits a string into tokens using characters in set as token delimiters.
    character(*), intent(in) :: string
    character(*), intent(in) :: set
    character(:), allocatable :: tokens(:)
    call split_tokens(string, set, tokens)
  end function string_tokens

but it allocates the function result before returning it to the caller. It's this extra allocation that makes the difference.

There is probably some minimal overhead with calling the subroutine, but it should be negligible.

jvdp1 · 2020-11-18T06:39:33Z

A long discussion on the split function was held during the Fortran Monthly call (November 2020) (from5:39).

milancurcic · 2020-12-09T16:46:01Z

Comment from Youtube user ES on https://youtube.com/watch?v=HI-Yhn7Q8Ko:

Here's how I might choose to implement split in C++ (but you should be able to do it similarly in Fortran (which I'm still learning)):

The function or subroutine would have a string input, a set of delimiters input, and it would return/modify a "split_string" object whose structure would look like:

members: the string itself, an array of index locations of the delimiters in the string

member functions: A "get_nth_tocken" function to return the nth token of the string. Or even overload the operator[].

This way I keep allocations small and to a minimum, and scan through the string only once to construct the index array.

In my opinion, the best solution would have to involve a derived type because of the ragged array nature of the collection of tokens. Otherwise, the only solution that makes sense to me in the discussion is the split_first_last and the "find" version.

certik · 2020-12-09T20:55:26Z

@esterjo thank you and welcome! I personally agree and would like new additions to Fortran to first go into a library, get the API ironed out, get some usage, and only then propose it for the Fortran standard itself if there is interest. But other people at the committee have a different opinion on this and voted to have this in the language itself right away. So the second best we can do is to implement this into stdlib and identify potential issues with the API, and submit proposals for the committee to fix some of the issues that we found, and that is what @milancurcic plans to do. :)

esterjo · 2020-12-09T21:30:25Z

@certik Got it. Thank you for clarifying the situation for me, I wasn't aware.

esterjo · 2020-12-09T21:48:40Z

Apologies if this isn't the right place, but as a broad comment, if I could make one wish to the ISO gods it would be to make it easier for a user to build libraries. This would make it easier for the language to grow with the community

certik · 2020-12-09T22:05:51Z

@esterjo, you can propose that at https://github.com/j3-fortran/fortran_proposals, just open a new issue and try to describe the features you have in mind in more detail that would make it easier to build libraries. We can discuss it there.

ivan-pi · 2020-12-10T10:20:22Z

Comment from Youtube user ES on https://youtube.com/watch?v=HI-Yhn7Q8Ko:

Here's how I might choose to implement split in C++ (but you should be able to do it similarly in Fortran (which I'm still learning)):
The function or subroutine would have a string input, a set of delimiters input, and it would return/modify a "split_string" object whose structure would look like:

members: the string itself, an array of index locations of the delimiters in the string

member functions: A "get_nth_tocken" function to return the nth token of the string. Or even overload the operator[].

This way I keep allocations small and to a minimum, and scan through the string only once to construct the index array.
In my opinion, the best solution would have to involve a derived type because of the ragged array nature of the collection of tokens. Otherwise, the only solution that makes sense to me in the discussion is the split_first_last and the "find" version.

Hi @esterjo, I think your idea with a "split string" type is good. First we would need to agree upon some encapsulated string type (see #69). There are several disjoint community efforts in this direction already. With the new deferred-length character strings, it is quite simple to use something like:

type :: string
  character(len=:), allocatable :: s
end type

A generic interface can be used to overload the split function to return an array of type(string), allocatable :: string_tokens(:) which are of the correct length each. Internally, this function could call split_first_last or split_pos to minimize internal memory shuffling. For Fortran users too lazy to bother with derived types, it will not clash with the version that returns characters with trailing whitespace.

With the great strides fpm has been making, it is already becoming easier to use other community packages. In fact you can already test the version @milancurcic prepared at: https://github.com/milancurcic/fortran202x_split

esterjo · 2020-12-17T03:40:46Z

How about this:

There should be a string_array type. This type:

Would hold all the "string" elements of the array by storing them side by side in one contiguous character string, called "data".
It would also store an index array (possibly empty) whose nth element marks the end of the nth string in the "data".
A string type would be an instance of this where this index array is empty
Making use of this index array would allow for a function to return the nth string in the "data"

Inheriting form this string_array would be a split_string type:

It would be no different, but it's index array would simply be the locations of delimiters in the contiguous character string
By making use of this index array this child-type would have a function to return the nth token (perhaps overloading the one used by the parent type), which would just be the nth string in the "data" without the first character
Perhaps it holds an array of delimiters

Some things I like about this kind of implementation:

it reduces memory fragmentation because you do not allocate multiple chunks of heap for each string in the array, while also avoiding padding short strings like a simple character array of fixed size elements.
splitting a string type does not reallocate it's character vector, but simply modifies the index array
extra capacity can be added to the end of the "data" array to allow for fast addition of small string_arrays to the end.
if the index array has 2 columns, then the string elements can be accessed as if they are laid out in a matrix

certik mentioned this issue Oct 27, 2020

SPLIT intrinsic j3-fortran/fortran_proposals#187

Open

milancurcic self-assigned this Nov 1, 2020

awvwgk mentioned this issue Dec 13, 2020

Document relation of this project with stdlib/j3-fortran milancurcic/fortran202x_split#9

Closed

ivan-pi mentioned this issue Dec 18, 2020

Proposal for lists of strings #268

Open

This was referenced Feb 14, 2021

Extend stdlib_ascii logical functions to character strings #321

Open

List of strings (implementation ideas) #322

Open

ivan-pi mentioned this issue Mar 30, 2021

Routines to handle (allocatable) character arrays #315

Open

awvwgk mentioned this issue Apr 11, 2021

Strip and chomp #385

Open

ivan-pi mentioned this issue Apr 30, 2021

Procedures from iso_fortran_strings #406

Open

awvwgk added the topic: standard Features from upcoming standards label Sep 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include a `split` function (202X feature) #241

Include a `split` function (202X feature) #241

certik commented Oct 27, 2020 •

edited

Loading

milancurcic commented Nov 1, 2020

jvdp1 commented Nov 6, 2020

certik commented Nov 6, 2020

milancurcic commented Nov 7, 2020 •

edited

Loading

ivan-pi commented Nov 7, 2020

milancurcic commented Nov 7, 2020

jvdp1 commented Nov 8, 2020

milancurcic commented Nov 9, 2020

certik commented Nov 9, 2020 via email

milancurcic commented Nov 9, 2020

milancurcic commented Nov 9, 2020

certik commented Nov 10, 2020 •

edited

Loading

jvdp1 commented Nov 10, 2020

milancurcic commented Nov 11, 2020

jvdp1 commented Nov 11, 2020

milancurcic commented Nov 11, 2020

jvdp1 commented Nov 18, 2020

milancurcic commented Dec 9, 2020

certik commented Dec 9, 2020 •

edited

Loading

esterjo commented Dec 9, 2020

esterjo commented Dec 9, 2020

certik commented Dec 9, 2020

ivan-pi commented Dec 10, 2020

esterjo commented Dec 17, 2020

Include a split function (202X feature) #241

Include a split function (202X feature) #241

Comments

certik commented Oct 27, 2020 • edited Loading

milancurcic commented Nov 1, 2020

jvdp1 commented Nov 6, 2020

certik commented Nov 6, 2020

milancurcic commented Nov 7, 2020 • edited Loading

ivan-pi commented Nov 7, 2020

milancurcic commented Nov 7, 2020

jvdp1 commented Nov 8, 2020

milancurcic commented Nov 9, 2020

certik commented Nov 9, 2020 via email

milancurcic commented Nov 9, 2020

milancurcic commented Nov 9, 2020

certik commented Nov 10, 2020 • edited Loading

jvdp1 commented Nov 10, 2020

milancurcic commented Nov 11, 2020

jvdp1 commented Nov 11, 2020

milancurcic commented Nov 11, 2020

jvdp1 commented Nov 18, 2020

milancurcic commented Dec 9, 2020

certik commented Dec 9, 2020 • edited Loading

esterjo commented Dec 9, 2020

esterjo commented Dec 9, 2020

certik commented Dec 9, 2020

ivan-pi commented Dec 10, 2020

esterjo commented Dec 17, 2020

Include a `split` function (202X feature) #241

Include a `split` function (202X feature) #241

certik commented Oct 27, 2020 •

edited

Loading

milancurcic commented Nov 7, 2020 •

edited

Loading

certik commented Nov 10, 2020 •

edited

Loading

certik commented Dec 9, 2020 •

edited

Loading