Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace substring in character/string_type #366

Open
ivan-pi opened this issue Mar 27, 2021 · 23 comments
Open

Replace substring in character/string_type #366

ivan-pi opened this issue Mar 27, 2021 · 23 comments
Assignees
Labels
API Discussion on a specific API for a proposal (exploratory) easy Difficulty level is easy and good for starting into this project topic: strings String processing

Comments

@ivan-pi
Copy link
Member

ivan-pi commented Mar 27, 2021

I also propose to add a few other functions that can be commonly used.

  • Replace function, this can be implemented in several ways:

    • First implementation: string.replace(pos , len, new_string)
      • pos = starting position of the sub-string to be replaced.
      • len = length of the substring
      • new_string = the string that will replace the substring
    • Second implementation: string.replace(pos, old_string, new_string)
      • old_string = substring to be replaced.
      • pos = position to start the search of old_string
      • new_string = string that will replace old_substring
      • This function will replace only the first occurrence of the old_string.
    • Third implementation: string.replaceall(pos, old_string, new_string)
      • This meaning of variables is similar to the second implementation.
      • The difference is it will find and replace all the occurrences of old_string after pos and not just one.
  • We already have an operator to Concat a string with another but I feel we should also implement a function for the same.

The community can have a discussion over the need for such functions and if needed also suggest some other implementation of Replace function.

Originally posted by @ChetanKarwa in #364 (comment)

@ivan-pi
Copy link
Member Author

ivan-pi commented Mar 27, 2021

This has also been requested a few times before.

@zbeekman wrote in #69 (comment):

sub: from Ruby, replaces the first occurance of a substring with a substitution, optional argument to replace the last string
gsub: Replaces all occurrences with a substring

@jacobwilliams suggested it also in j3-fortran/fortran_proposals#96 (comment):

Back to original topic. Here's one I use all the time:

1. `string_replace`

2. Replace all occurrences of one string with another

3. Yes, Python has this:
>>> 'aaaa'.replace('a','bb')
'bbbbbbbb'
1. Something along these lines:
character(len=:),allocatable :: str
str = 'aaaa'
call string_replace(str,'a','bb')
write(*,*) str ! writes bbbbbbbb

@awvwgk
Copy link
Member

awvwgk commented Mar 27, 2021

There is also a proposal from the string thread:

Gsub

  • Global substitution. Same as @jacobwilliams' string_replace as far as I can tell. Eventually it would be nice to support regular expressions like Ruby, but, for now, we should start smaller and just make it behave like:
    sed 's/<pattern>/<replacement>/g'
  • Kinda Ruby, Python and sed. Behave like Python without the optional count argument
  • Fortran:
gsub("hello", "l", "L") ! "heLLo"
gsub("the quick brown fox jumped over the lazy dog", "the", "a")
    ! "a quick brown fox jumped over a lazy dog"
!! Or OO approach
s = "hello"
s%sub("l","L")              ! "heLLo"

Sub

  • Replace first (or last) instance of one substring with another
  • Again, similarities with Python and Ruby, but not an exact match
  • Fortran:
sub("hello", "l", "L") ! "heLlo"
sub("hello", "l", "L", back=.true.) ! "helLo"
sub("the quick brown fox jumped over the lazy dog", "the", "a")
    ! "a quick brown fox jumped over the lazy dog"
sub("the quick brown fox jumped over the lazy dog", "the", "a", back=.true.)
    ! "the quick brown fox jumped over a lazy dog"
!! Or OO approach
s = "hello"
s%sub("l","L")              ! "heLlo"
s%sub("l","L", back=.true.) ! "helLo"

Originally proposed by @zbeekman in #69 (comment)

@awvwgk awvwgk added easy Difficulty level is easy and good for starting into this project topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ... labels Mar 27, 2021
@awvwgk
Copy link
Member

awvwgk commented Mar 27, 2021

We have to be a bit careful with the string related patches, because there are only a few modules and many independent features. This will likely result in many conflicting patches and will require a lot of rebasing along the way. The module to put those functions in would be stdlib_strings, which is not yet created on the default branch.

@awvwgk awvwgk added the API Discussion on a specific API for a proposal (exploratory) label Mar 27, 2021
@aman-godara
Copy link
Member

aman-godara commented Mar 28, 2021

  • Third implementation: string.replaceall(pos, old_string, new_string)

    • This meaning of variables is similar to the second implementation.
    • The difference is it will find and replace all the occurrences of old_string after pos and not just one.

What would we expect the function to return if the input is "qwqwqw" and we want to replace all "qwq" with let's say "abc"?
Option 1: "abcwqw"
Option 2: "abcabcw" (here length of the output is greater than the length of the input)
[EDIT]: Option 3: abcw

Option 2 is inspired from the fact that the second q is part of two occurrences of qwq. I cannot think of a particular use case where a user might want option 2 to be the output but it occurred to me so I thought it would be better to put it on the table.

@urbanjost
Copy link

features I would like would be to

  • specify the Nth occurrence starting from the left or right end of the string,
  • ignore case in the old string when doing a match,
  • and to be a subroutine for efficiency but a function for convenience.
    For example:
    replace. Starting at the other end should probably be implemented as with INDEX() and so on. That particular routine has had feature creep over the years and could be written to be more efficient but has the basic functionality I have found useful.

@ivan-pi
Copy link
Member Author

ivan-pi commented Mar 28, 2021

  • Python

    str.replace(old, new[, count])
    Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

  • Javascript

    const newStr = str.replace(regexp|substr, newSubstr|function)
    The replace() method returns a new string with some or all matches of a pattern replaced by a replacement. The pattern can be a string or a RegExp, and the replacement can be a string or a function to be called for each match. If pattern is a string, only the first occurrence will be replaced.

    The original string is left unchanged.

    const newStr = str.replaceAll(regexp|substr, newSubstr|function)
    The replaceAll() method returns a new string with all matches of a pattern replaced by a replacement. The pattern can be a string or a RegExp, and the replacement can be a string or a function to be called for each match.

    The original string is left unchanged.

  • Julia

    replace(s::AbstractString, pat=>r; [count::Integer])
    Search for the given pattern pat in s, and replace each occurrence with r. If count is provided, replace at most count occurrences. pat may be a single character, a vector or a set of characters, a string, or a regular expression. If r is a function, each occurrence is replaced with r(s) where s is the matched substring (when pat is a AbstractPattern or AbstractString) or character (when pat is an AbstractChar or a collection of AbstractChar). If pat is a regular expression and r is a SubstitutionString, then capture group references in r are replaced with the corresponding matched text. To remove instances of pat from string, set r to the empty String ("").

  • Perl

    The Perl substring is interesting in that it can be used both on the right or left side of an assignment. So you can use it to return a new string (substr on the right side), or replace a substring in place (substr on the left side). In Fortran this would require a function and subroutine, as suggested by @urbanjost.

I also support the options suggested by @urbanjost:

  • count or repeat (replace only first count appearances)
  • ignore_case
  • occurence (if present, start changing at the Nth occurrence of the OLD string)

but I would prefer to split them into two functions.

@ChetanKarwa
Copy link

Function introduced by @urbanjost can work as an all in one pack for any kind of "replace" operation that a user wants to execute.
I support such an all in one function.

But I have a doubt, in deciding the Nth occurrence.
For eg: if the parent string is "abababababa" and the old string is "aba"

  • What will be the 3rd occurrence of old in this parent
    • abab aba baba
    • abababab aba

@ivan-pi
Copy link
Member Author

ivan-pi commented Mar 28, 2021

What will be the 3rd occurrence of old in this parent

  • abab aba baba
  • abababab aba

Python uses a non-overlapping count:

>>> "abababababa".replace("aba","zzz")
'zzzbzzzbzzz'
>>> "abababababa".replace("aba","zzz",1)
'zzzbabababa'
>>> "abababababa".replace("aba","zzz",2)
'zzzbzzzbaba'

which corresponds to your second bullet point.

Edit: The StringiFor library also provides a replace method matching the Python one.

@ChetanKarwa
Copy link

ChetanKarwa commented Mar 28, 2021

If everyone is fine with the idea.
I wish to start working on this issue. I can start off by implementing the function proposed by @urbanjost.

@awvwgk
Copy link
Member

awvwgk commented Mar 30, 2021

Feel free to start working on a patch, @ChetanKarwa . You can either start from the default branch and we will figure out rebasing the patch later, or you could base your feature branch on the patch in #343, which already creates the stdlib_strings module.

Let me know if you need any help with this.

@ChetanKarwa
Copy link

I wanted to ask these questions:

  • Shall I implement the KMP algorithm that searches in O(N) time and O(N) space complexity or a Naive algorithm that takes O(N^2) time complexity and no extra space.
  • Do we need the argument 'cmd' as given in the post by @urbanjost or just keeping old and new is fine enough.
  • How should the function be accessed.
    • replace(parent_str, old_str, new_str)
    • parent_str.replace(old_str,new_str)

@ivan-pi
Copy link
Member Author

ivan-pi commented Mar 30, 2021

  • Do we need the argument 'cmd' as given in the post by @urbanjost or just keeping old and new is fine enough.

I suggest you omit cmd for now.

How should the function be accessed.

  • replace(parent_str, old_str, new_str)
  • parent_str.replace(old_str,new_str)

The first one. (For the second version you would need first a string class (#333) which is not available yet. Moreover, the second version would likely just call the first version.) More importantly I think replace will be a function and not a subroutine.

If KMP is not too challenging I think that would be a great start. I guess in practice there will be some trade-offs (e.g. depending on the string sizes, cache size, etc.).

@urbanjost
Copy link

urbanjost commented Mar 30, 2021

I meant to mention not to consider the "cmd" option. I consider the other functionality desirable (ignore case of the old string, replace right to left as well as left to right, some type of selection of which occurrence(s) to replace).

The replace() function mentioned has a lot of history, and was originally part of a line editor so "cmd" preceded having an old
and new string.Although still used in my case I would not recommend it in a new implementation. I used it as an example of time-tested desirable features, not as an implementation model. For the cases it was intended for the volume of data being changed did not require much efficiency, but I can see use cases where efficiency would be desirable,

A KMP implementation would potentially be reusable for other functions ( I would be curious to time it against INDEX(3f) in different compilers too) as well, and seems worth implementing. I would be curious if anyone has a known use case for processing large amounts of data with replace(3f) or for searching for words efficiently in general in other proposed stdlib procedures.

I would implement a request to search from right to left with a logical called BACK to be consistent with intrinsics such as INDEX(3f), VERIFY(3f), and SCAN(3f) for consistency instead of using the sign of the occurrence.

@arjenmarkus
Copy link
Member

An implementation of "replace all" should be careful with overlapping strings. Consider the following example with @aman-godara 's option 2:
The string qwqwq and replacing qwq by abc to give abcabcw.

If the replacing string is qwqwq you would get an infinitely long string.

In addition to the above examples from other languages, I would like to point out the command string map from Tcl:

set a "A B C A"
string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I find it very convenient in my Tcl programs and I could probably use it in Fortran as well.

@ChetanKarwa
Copy link

In addition to the above examples from other languages, I would like to point out the command string map from Tcl:

set a "A B C A"
string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I didn't get this part, would anyone mind explaining with a proper example of the use of function replace

@arjenmarkus
Copy link
Member

arjenmarkus commented Mar 31, 2021 via email

@ivan-pi
Copy link
Member Author

ivan-pi commented Mar 31, 2021

In addition to the above examples from other languages, I would like to point out the command string map from Tcl:
set a "A B C A"
string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I didn't get this part, would anyone mind explaining with a proper example of the use of function replace

You can find a complete explanation in the Tcler wiki: https://wiki.tcl-lang.org/page/string+map

Translated to Fortran it might look like

a = "A B C A"
b = replace(a,[map("A","AA"),map("B","BB")])

It is a more powerful version of replace which allows to map multiple substrings in one pass. The rule to prevent overlap issues is

Replacement is done in an ordered manner, so the key appearing first in the list will be checked first, and so on. string is only iterated over once, so earlier key replacements will have no effect for later key matches.

But this should go to a future PR.

@urbanjost
Copy link

although TCL calls them keys and values think of the map list as just a set of old1,new1 old2,new2, old3,new3 so that would be
equivalent to multiple calls to REPLACE(3f) assuming REPLACE(3f) is not elemental, except that it is all done in one pass with respect to the original string I think. Actually not sure which way TCL did that, and thought TCL would retain the spaces so it has been too long since I used TCL but either way it raises the question of allowing multiple pairs or old and new values in one call, and if you did would the replacements be cumulative or not?
,,
So if you started with the string "AB" and said you wanted to replace "A" with "B" and "B" with "XXX" if you did it one pair at a time and used the output of previous changes as input to new changes (like you called REPLACE(3f) twice) you would get
AB=>BB by the first replacement, and then BB=>XXXXXX; but if you did all replacements relative to the original input string you would get "BXXX". So if you allow multiple OLD,NEW pairs you have at least those two possible outcomes to consider. Again, I have not used TCL in a long time, but irregardless if you allow multiple pairs you have to decide if the result should be identical to multiple calls to REPLACE(3f) or if all the substitutions would be in reference to the original string, more like variable name substitution in the bash(1) shell, for example.

@aman-godara
Copy link
Member

aman-godara commented May 31, 2021

I wish to work on this issue. Please assign this to me.

Also, we have intrinsic index function in fortran and some of the above mentioned features like replace_all and getting the index of the 3rd occurrence of the substring in the input string when implemented efficiently require manipulating the index function internally. So we will have to implement index function from scratch.
Also, @awvwgk pointed out that we cannot name this new function as index. I think may be because its features conflicts with the intrinsic index function.
Please also note: We currently have index function in our stdlib_string_type module as well.

@awvwgk
Copy link
Member

awvwgk commented May 31, 2021

Maybe there is a better name like find_substring or just find instead of index? If the interface turns out similar enough to the intrinsic index function we can think about overloading the index function with this functionality and eventually propose it as addition to the Fortran standard.

@aman-godara
Copy link
Member

Since we always put a low-level functionality forward first for a new functionality that we wish to add to stdlib. I decided to implement replace_all.
I personally try to design these low-level functionality such that all possible high-level operations can still be done using a combination of many low-level functionalities. For eg: I decided to not add pos argument at this moment because the same operation can be done using a combination of slice (already added to stdlib) and find (under review).
But I added replace_overlapping feature to replace_all because this cannot be achieved in any other way.

But tcl's replace_map functionality is still not added in this replace_all.
I wonder what would the output be for Tcl's replace_map function if the input string is "wababw" and the input map is {"abab" -> "pqrs", "ba" -> "ds"}
will it be
Option 1). "wpqrsdsw"
Option 2). "wpqrsw"?

Taking example from @urbanjost's comment.
I think one possible hacky way of replacing "AB" using the map {"A" -> "B", "B" -> "XXX"} would be to replace "A" with a dummy character let's say "$" and "B" with let's say "#" and do the normal replace_all call twice i.e. once for each of the two replacements. This will give us "$#". Now replace "$" with "B" and then "#" with "XXX", again by calling replace_all twice. So at the end we will have "BXXX". In short, hacky solution first takes the input to an intermediate stage and from this intermediate stage it takes to the final solution.

@arjenmarkus
Copy link
Member

arjenmarkus commented Jun 16, 2021 via email

@aman-godara
Copy link
Member

I tried an online compiler for Tcl (by tutorials point) to understand more about the behaviour of the function.

Here is the code:

set a "wababw"
set a [string map {"abab" "ppp" "abw" "tt"} $a]
puts $a

output is: wpppw.

It confirmed that overlapping substitutions are avoided.

Also, one thing to note is when I tried

set a "a b c a"
set a [string map {"a" "aa" "b" "bb"} $a]
puts $a

Output retained the spaces: "aa bb c aa" for this online compiler (it says Tcl v8.6.6) and tried few other similar operations to conclude that the length of the output may not be equal to that of input.

Conclusion:
The "hacky way" discussed in here will work everytime to deliver the behaviour of Tcl's replace_map functionality, only hurdle is to very carefully come up with dummy character(s).

But IMO there can be other ways to deliver replace_map functionality. Inspiration Aho_Corasick_Algorithm but not exactly this.

@awvwgk awvwgk added topic: strings String processing and removed topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ... labels Sep 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Discussion on a specific API for a proposal (exploratory) easy Difficulty level is easy and good for starting into this project topic: strings String processing
Projects
None yet
Development

No branches or pull requests

6 participants