Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to pad string with zeros #394

Closed
ivan-pi opened this issue Apr 23, 2021 · 15 comments · Fixed by #689
Closed

Function to pad string with zeros #394

ivan-pi opened this issue Apr 23, 2021 · 15 comments · Fixed by #689
Labels
easy Difficulty level is easy and good for starting into this project good first issue Good for newcomers idea Proposition of an idea and opening an issue to discuss it topic: strings String processing

Comments

@ivan-pi
Copy link
Member

ivan-pi commented Apr 23, 2021

Description

A function to pad a string with zeros or another symbol to a given width could be helpful in many cases, especially together with the to_string() function for integer to string conversion.

Currently available methods to achieve this are using concatenation

width = 32
str = "hello"
padded_str = repeat('0',width-len(str)')//str

and perhaps also internal file I/O

write(buffer,'(A32)', blank='zero') "hello"   ! doesn't work for some reason

I propose to add the following functions

! left - padding
function padl(str,width,symbol) result(padded_str)
  character(len=*), intent(in) :: str
  integer, intent(in) :: width
  character(len=1), intent(in), optional :: symbol
  character(len=max(len(str),width) :: padded_str
end function
! right - padding
function padr(str,width,symbol) result(padded_str)
  character(len=*), intent(in) :: str
  integer, intent(in) :: width
  character(len=1), intent(in), optional :: symbol
  character(len=max(len(str),width) :: padded_str
end function

The original string is returned if width < len(str). The default padding symbol could be either the blank symbol ' ' or '0'. Special treatment of a leading sign symbol '+'/'-' might be also desirable.

Prior art

  • Python: zfill
    (Another method to achieve padding in Python is using the string class method .format())
  • Julia: lpad, rpad
  • MATLAB: pad
@milancurcic
Copy link
Member

milancurcic commented Apr 23, 2021

I use this often to left-pad integers with zeros when enumerating filenames, but using the edit descriptors in format.

Do you think a variant of padl and padr that accepts integers would be useful in addition to the ones you proposed?

res = padl(23, 4) ! returns "0023"

Compare it to:

res = padl(to_string(23), 4) ! returns "0023"

(Edit: fixed the snippet above)

I like the proposed names. Rarely there's an opportunity to introduce such short but clear names.

@ivan-pi
Copy link
Member Author

ivan-pi commented Apr 23, 2021

I use this often to left-pad integers with zeros when enumerating filenames, but using the edit descriptors in format.

That's my use case too. Currently I resort to the pattern:

integer :: istep,unit
character(len=32) :: step
write(step,'(I0.8)') istep
open(newunit=unit,file="output"//trim(step)//".txt")

Note your second call should be res = padl(to_string(23),4).

Concerning the name, I thought that padl/padr is more in line with the adjustl/adjustr Fortran intrinsic functions than the prefixed versions like Julia has.

I'm not yet sure if making zero the default padding symbol is the right thing to do. But I would certainly support introducing an integer to zero-padded string function. I use this in practically every code. Maybe borrowing the Python zfill name specifically for this purpose?

function zfill(x,w) result(res)
  integer, intent(in) :: x
  integer, intent(in) :: width
  character(len=:), allocatable :: res
end function

integer :: unit
open(newunit=unit,"output"//zfill(23,4)//".txt)    ! opens "output0023.txt"

@milancurcic
Copy link
Member

Indeed, with the second snippet corrected, the difference is less drastic.

I usually like a separate function with a specific purpose (e.g. zfill), although the same thing can be accomplished with padl with not that much more verbosity. So now I'm split between the two approaches.

@ivan-pi
Copy link
Member Author

ivan-pi commented Apr 23, 2021

Maybe instead of leaving the concatenation to the consumer, we can make an even more specific function like

function filename(name,num,extension) result(res)
character(len=*), intent(in) :: name
integer, intent(in) :: num, width
character(len=*), intent(in) :: extension
character(len=:), allocatable :: res
end function

where we would allow some primitive type of integer formatting (e.g. using something like {:08d} in Python). But at this point, I wonder why not just go back to the string buffer

character(len=14) :: filename
write(filename,'(A,I0.8,A)') "myfile", 23, ".txt"

So I think a general zero-padding function will give greater flexibility.

Irrespective of the default padding symbol, if we overload pad to work for integers, you could use padl(23,4,'0') which is pretty short.

My preferences:

  1. If the default padding character is '0', we also overload it to work for integers, meaning padl(23,4) is equivalent to padl(23,4,'0').
  2. If the default padding character is ' ', we introduce a separate function (called zfill,zpad,padz, enumz,zstr or something similar) specific to the file enumeration scenario.

So now we need some arguments to support/reject the blank symbol.

cc @LKedward @awvwgk

@awvwgk
Copy link
Member

awvwgk commented Apr 23, 2021

Maybe instead of leaving the concatenation to the consumer, we can make an even more specific function like

I'm using something similar for unit testing round-tripping in an IO library of mine where I need some temporary file names rather than scratch units in this particular context:
https://github.com/grimme-lab/mctc-lib/blob/5603d588ac12fd20b6643527d7cf68fe8c05e070/test/test_read.f90#L420-L429
https://github.com/grimme-lab/mctc-lib/blob/5603d588ac12fd20b6643527d7cf68fe8c05e070/test/test_write.f90#L218-L227

Writing a small helper is especially easy in this scenario, because you are padding the integer printout to a fixed length and therefore know exactly the required size of the buffer.


As a first start, we could wrap internal file IO as function similar to to_string but allow the user to specify the format, which gives at least access to all intrinsic formatters:

character(len=:), allocatable :: str
str = to_string(42)  ! str == "42"
str = to_string(42, '(i0.4)')  ! str == "0042"
str = to_string(42, '(z0.4)')  ! str == "002A"

Using to_string(42, '(i0)') will be slower than to_string(42) because of the internal file IO in such an approach. We might have to parse the formatter as well to avoid overflows for to_string(42, '(i0.1000)') or similar.

@urbanjost
Copy link

lenset, stretch, and atleast in M_strings, which duplicate some functionality for historical purposes are used for slightly different use cases, as shown in the examples in the man-pages in the link above.

There are a couple of other routines like uniq(3f) in M_io at the same github site that are related to adding a numeric suffix to a name, although I usually just use an internal WRITE() to add a numeric suffix in a loop when I want to overwrite like

 
CHARACTER(LEN=256) :: FILENAME
INTEGER :: I
I=12
WRITE(FILENAME,'("PREFIX.",i0.3)')i
WRITE(*,*)'OPEN FILE ',trim(FILENAME)
I=1000
WRITE(FILENAME,'("PREFIX.",i0.3)')i
WRITE(*,*)'OPEN FILE ',trim(FILENAME)
end

shows, with newer features like "i0" lets you get leading zeros but still get a good name if you go over the expected range of numbers.

Just some Fortran-based prior art with some variants from what was discussed that were/are needed at various times (put them all together with the intent of doing something like stdlib, but hoping efforts like this will do the work for me!).
where the internal write with a format gives nice leading zeros

@ivan-pi ivan-pi added easy Difficulty level is easy and good for starting into this project good first issue Good for newcomers idea Proposition of an idea and opening an issue to discuss it labels Apr 26, 2021
@ivan-pi
Copy link
Member Author

ivan-pi commented Apr 26, 2021

I made a small demonstration of padding without the optional padding symbol.

module pad_mod
  implicit none
contains
  function padl(str,width) result(res)
    character(len=*), intent(in) :: str
    integer, intent(in) :: width
    character(len=max(len_trim(str),width)) :: res
    res = str
    res = adjustr(res)
  end function
  function padr(str,width) result(res)
    character(len=*), intent(in) :: str
    integer, intent(in) :: width
    character(len=max(len_trim(str),width)) :: res
    res = str
  end function
end module

program test_pad
  use pad_mod
  implicit none
  print *, "Hello"
  print *, padl("Hello",10)
  print *, padr("Hello",12), len(padr("Hello",12))
end program

Output of the program:

$ gfortran pad_mod.f90
$ ./a.out
 Hello
      Hello
 Hello                 12

I'm not really sure what is the purpose of padr with blanks. It is effectively the opposite of trim.

@ivan-pi
Copy link
Member Author

ivan-pi commented Apr 26, 2021

Depending on what our goals are, we could simply introduce pad (left-pad) as the opposite of trim. Just throwing it out on the table.

Right padding would be achieved with adjustl(pad(str,width)) (symmetric with respect to how left-trimming can be currently performed using trim(adjustl(str))).

@ivan-pi
Copy link
Member Author

ivan-pi commented Apr 26, 2021

I also found one previous Fortran implementation of zfill here (GPL Licensed).

@aman-godara
Copy link
Member

aman-godara commented Jun 19, 2021

This is what was proposed in the proposal for GSoC. This was designed with an aim to provide as much functionalities as possible to a user. It might vary from current state of agreement but let me still put it here.

Signature: pad(string, width, pad_with(Optional) = “ ”)
Output: a new string_type object

Function has 2 versions: padl and padr (like adjustl and adjustr)
If the width required after the execution of the function is greater than
len(string) then a new string_type object with one side (either left or right)
padded with pad_with character sequence is returned, otherwise a deep
copy of input string is returned. Note that the length of pad_with must be
greater than 0 and can be greater than 1.
If pad_with has length greater than 1 then there are 2 possibilities of
padding: padding from left to right or padding from right to left. Take a look
at the example to understand this better:

padl(“pad this”, width = 15, pad_with = “!@#”):

1). “!@#!@#!pad this” (padding from left to right)
2). “#!@#!@#pad this” (padding from right to left)
Function will provide both the options to users. Note that pad is different
from adjustr and adjustl also because adjust* provide an inherent trimming in them
whereas pad function doesn’t.

I think it is better to keep size of pad_with to 1 at this moment.
If we agree to truncate the output (if needed) so that it is strictly of the asked length, then IMO it shouldn't be named as pad.

Can we also divide trim into 2: triml and trimr?: but then it would become inconsistent with trim intrinsic ultimately we will have to rename the function.
I think adjustl and adjustr make use of something like padr(triml( string ), len(string)) and padl(trimr( string ), len(string)) respectively where optional argument pad_with is ' ' (1 space).

@ivan-pi
Copy link
Member Author

ivan-pi commented Jun 21, 2021

I think it is better to keep size of pad_with to 1 at this moment.

I agree. If we can reach agreement on the behavior of padding strings where len(pad_with) > 1 is true, we can always add them later in a backward compatible way.

If we agree to truncate the output (if needed) so that it is strictly of the asked length, then IMO it shouldn't be named as pad.

For fixed length strings truncation will happen upon assignment, while allocatable strings will be re-allocated to the result length. We cannot change these rules. But I don't see any issue with the case when len(pad_with) == 1 is true.

I think adjustl and adjustr make use of something like padr(triml( string ), len(string)) and padl(trimr( string ), len(string)) respectively where optional argument pad_with is ' ' (1 space).

I guess these are equal in practice, but at the compiler level the implementation for the intrinsic functions might work slightly differently. The big difference is of course that adjustl and adjustr are functions used when working with fixed-length (or already allocated) strings, and the result length is determined implicitly from the actual string size. They don't offer the possibility to modify the length of the resulting string (used as an expression or before assignment occurs).

Can we also divide trim into 2: triml and trimr?: but then it would become inconsistent with trim intrinsic ultimately we will have to rename the function.

Regarding trim, you can already achieve trimming on both sides:

  • trim(string): right trim
  • trim(adjustl(string)): left trim

One way to condense the interfaces would be to extend trim with an optional argument:

function trim(string,left) result(res)
  character(len=*), intent(in) :: string
  logical, intent(in), optional :: left ! .false. by default
  integer, parameter :: reslen = len_trim(adjustl(string))
  character(len=reslen) :: res

Then we would would have:

  • trim(string): right trim
  • trim(string,left=.true.): left trim

which actually takes three more characters than the currently available method and also does not read sequentially compared to what really happens. If you leave out the keyword like trim(string,.true.) to make it shorter it is just confusing for the person reading the code. Separate functions (i.e. triml/trimr) are a better solution then. For the time being I would just avoid trim until someone opens a dedicated issue about it.

@aman-godara
Copy link
Member

aman-godara commented Jun 21, 2021

I wasn't able to explain myself in the best possible manner, excuse me for that.

By truncate I was referring to the case when a user gives something like this:

padl("    This string is of length 34   ", width = 20)

Then the output " This string is of length 34 " (length 34) is better than the output " This string is o" (truncated output of asked length 20) for pad function. If one decides to return latter then it shouldn't be named as pad IMO.

If we run the padl function as implemented in this comment on this same input, output will be " This string is of length 34" (length 30) which is also performing a truncation but without any loss of information.

@ivan-pi
Copy link
Member Author

ivan-pi commented Jun 21, 2021

Yes, I believe we should not do a "hard truncation", but rather follow the behavior in other languages:

  • Julia:

    Stringify s and pad the resulting string on the left with p to make it n characters (code points) long. If s is already n characters long, an equal string is returned. Pad with spaces by default."

  • MATLAB

    pad(str,numberOfCharacters) adds space characters so the strings in newStr have the length specified by numberOfCharacters. If any strings in str have more characters than numberOfCharacters, then pad does not modify them.

@ivan-pi
Copy link
Member Author

ivan-pi commented Jul 23, 2021

@milancurcic, now that #441 added padl should we still pursue a specific function for zero-filling integers or should we settle on a general format function like #444 that requires parsing a format string?

Using the currently available tools a string like "000123" could be created using:

character(len=6) :: zstr
zstr = padl(to_string(123),6,'0')

compared with the intrinsic internal I/O

write(zstr,'(I0.6)') 123

Of course the major difference is the stdlib functions can be inlined, e.g. unit = open("my_file"//padl(...)), freeing the programmer from the need to declare a character buffer.

Personally, I'd still be interested in having a zfill function like Python does.

@awvwgk awvwgk added the topic: strings String processing label Sep 18, 2021
@thaila027

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
easy Difficulty level is easy and good for starting into this project good first issue Good for newcomers idea Proposition of an idea and opening an issue to discuss it topic: strings String processing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants