Faster to_lower/to_upper implementations #703

jalvesz · 2023-03-24T21:17:41Z

After looking at the implementation for converting strings between lower and upper case I found a method that might be faster than the one currently implemented in stdlib and that should return the same output:

I published the method here https://stackoverflow.com/questions/10759375/how-can-i-write-a-to-upper-or-to-lower-function-in-f90/75838039#75838039 (look at the last answer to_lower_2 and to_upper_2)

thinking about doing a PR to contribute with these implementations, any opinions?

urbanjost · 2023-03-25T08:30:27Z

An FYI that I compared some case conversions and got unexpected speed
variations between compilers and sound some TRANSFER() functions really
slow, disappointing speeds using actual CHARACTER expressions instead of
reverting to ADEs, and all kinds of surprises but your method was the most
consistent. Yours is the same as UPPER3. UPPER-1 is just for a baseline
and assigns output to input. All are functions, although subroutines
can potentially be faster. So yours is the fastest with -O3 with ifort,
but several methods are faster with other compilers, as the gfortran
shows. But if you use INT8 integers (upper7) it ties with the fastest
ones I had with gfortran and still comes in a strong second with ifort. I do
not have the results from the other compilers and do not have access to them
at the moment, but I found upper7 either best or near best so at least on
some platforms using 1-byte integers was a significant gain; albeit it was
slower on others.

The scores for each compiler are relative to the assign. The times are in seconds.
Surprising the difference in speed for just the plain assign between gfortran and
ifort, isn't it?

 $ fpm run .. -flag -O3 --compiler gfortran
 Hello, time_case!
 upper-1    : score: 1000 time: 0.47156000000000003E-1      AbCdEfGhIjKlMnOpQrStUvWxYz
 upper5     : score: 0655 time: 0.71913000000000338E-1      ABCDEFGHIJKLMNOPQRSTUVWXYZ
 upper8     : score: 0655 time: 0.71990999999999694E-1      ABCDEFGHIJKLMNOPQRSTUVWXYZ
 upper7     : score: 0650 time: 0.72481000000000684E-1      ABCDEFGHIJKLMNOPQRSTUVWXYZ
 upper3     : score: 0499 time: 0.94342000000000148E-1      ABCDEFGHIJKLMNOPQRSTUVWXYZ
 upper4     : score: 0499 time: 0.94362000000000279E-1      ABCDEFGHIJKLMNOPQRSTUVWXYZ
 upper2     : score: 0210 time: 0.22387399999999946         ABCDEFGHIJKLMNOPQRSTUVWXYZ
 upper1     : score: 0186 time: 0.25304299999999991         ABCDEFGHIJKLMNOPQRSTUVWXYZ
 upper6     : score: 0175 time: 0.26923099999999955         ABCDEFGHIJKLMNOPQRSTUVWXYZ
 upper0     : score: 0009 time: 4.7819500000000001          ABCDEFGHIJKLMNOPQRSTUVWXYZ
 $ fpm run .. -flag -O3 --compiler ifort
 [100%] Project compiled successfully.
  Hello, time_case!
  upper-1    : score: 1000 time: .6184600000000001E-01       AbCdEfGhIjKlMnOpQrStUvWxYz
  upper3     : score: 0779 time: .7930900000000030E-01       ABCDEFGHIJKLMNOPQRSTUVWXYZ
  upper7     : score: 0702 time: .8806700000000056E-01       ABCDEFGHIJKLMNOPQRSTUVWXYZ
  upper2     : score: 0321 time: .1925020000000002           ABCDEFGHIJKLMNOPQRSTUVWXYZ
  upper4     : score: 0262 time: .2352109999999996           ABCDEFGHIJKLMNOPQRSTUVWXYZ
  upper6     : score: 0236 time: .2616720000000008           ABCDEFGHIJKLMNOPQRSTUVWXYZ
  upper1     : score: 0027 time: 2.277227000000000           ABCDEFGHIJKLMNOPQRSTUVWXYZ
  upper5     : score: 0027 time: 2.247931000000000           ABCDEFGHIJKLMNOPQRSTUVWXYZ
  upper8     : score: 0027 time: 2.246852000000001           ABCDEFGHIJKLMNOPQRSTUVWXYZ
  upper0     : score: 0014 time: 4.220628000000000           ABCDEFGHIJKLMNOPQRSTUVWXYZ

So this is your method except with some compilers I found wp=>int8 sped things up

pure function upper7(str) Result(string)
use,intrinsic :: iso_fortran_env, only : wp=>int8
Character(*), Intent(In) :: str
Character(LEN(str))      :: string

integer(kind=wp), parameter :: ADE_A = iachar('a'), ADE_Z = iachar('z')
integer(kind=wp), parameter :: CASE_DIFF = iachar('a')-iachar('A')
Integer(kind=wp)            :: ADE_char
Integer                       :: i
   do i = 1, len(str)
      ADE_char = iachar(str(i:i),wp) ! ASCII Decimal Equivalent 
      if (ADE_char >= ADE_A .and. ADE_char <= ADE_Z) ADE_char = ADE_char - CASE_DIFF
      string(i:i) = achar(ADE_char)
   enddo
end function upper7

The biggest surprise to me with variants of yours was that initially just doing string=str
and then only changing the lower-case characters was much slower than just doing
it for all characters.

Another surprise was Number 8, which uses a select, was a top performer with gfortran, and pretty dismal with ifort.

It would be interesting to see timing differences with crayfort, nagfor, nvfortran, ...

jalvesz · 2023-03-25T10:05:08Z

This is very interesting! When I was working on fast string to double conversion I also found that using 1-byte integers gave me a boost in performance. Would you mind sharing your benchmark fpm project? I can test on windows with ifort19, ifort23, ifx23 and on Linux with ifort19 and gfortran. I'm trying to get nvfortran, not there yet.

urbanjost · 2023-03-25T12:57:00Z

nothing fancy. Most interesting thing to me is difference in performance between compilers and compiler options for do concurrent, select, a basic assign, ADE integers versus CHARACTER, .... some compilers that are often the fastest for numeric calculations to very poorly with I/O, is one observation. So what might seem like a great algorithm developed with one compiler can be very bad with another. For whatever reasons the one you used does not seem to be horrible with anyone I tested with, and is either at the top or near it.

wget http://urbanjost.altervista.org/REMOVE/time_case.tgz

PS: not all the compilers actually do the DO CONCURRENT in parallel at all, or require special compiler options to do so.

jalvesz · 2023-03-26T11:09:19Z

Here a full comparison using your benchmark with the compilers I have at hand (increased the loop to 10000000, otherwise some would come out with elapsed time = 0)
https://docs.google.com/spreadsheets/d/1C4HUHNXtSj-qZ2JnQw2B_pG6JE9TCYdIN5fhuXQlTsU/edit?usp=sharing
It was also interesting to see that with ifort upper7 came on top, but there is quite a difference between ifort19 on windows and ifort19 on linux(wsl) (being the same machine).
Upper10 (same as stdlib) came on top only with ifx23.

urbanjost · 2023-03-26T18:27:43Z

Just looking at reasonable contenders using ifort, ifx, and gfortran with optimization on Linux upper7 (your case with int8 kind) was a clear winner for consistency and performance. Although several others outperformed it on compiler or in some cases (very long strings, string needing no change, string needing all characters changed) it is a clear winner with the ones tested.

The only one substantially beating it is if the DO CONCURRENT one is run in parallel with strings much longer than typical text lines. Since upper7 is no worse than 60 percent of the benchmark speed ( a simple assign), and is the most consistent assuming something is desired that is portable across compilers it looks good.

The rather interesting information is that what is a top performer on one compiler is many times slower with another with very short simple procedures, and that using ADE representations is usually much faster even all these years after CHARACTER variables were introduced, and some bugs encountered with arrays of function pointers across compilers that ended up making it just easier to do brute-force tests (a more sophisticated test exists but it encounters a lot of unrelated compiler issues).

It would be of particular interest to compare the stdlib version (upper10) and your proposed replacement (upper7) and see how it goes with other compilers as where the stdlib version performed very significantly worse was with ifort, which is being supplanted by ifx.

jalvesz · 2023-03-26T19:18:02Z

Got my hands on nvfortran23 in wsl following the install from https://developer.nvidia.com/hpc-sdk-downloads and reran the benchmark. Didn't check for any special compiler directive, just used -03 as for the others. The results are in the same google spreadsheet.

Have to check what will happen if I offload the computation to the GPU... And I think the do concurrent implementation can be modified.

Results change a bit from one run to another but indeed upper7 comes as quite a robust approach. With nvfortran added upper7 with do concurrent, results are slight better but close (I'm betting more on some randomness than an actual improvement) not sure this actually gives any advantage for such short test.

urbanjost · 2023-03-27T01:08:09Z

The upper6-style tests and code itself indicate it is not particularly sensitive to input being
all uppercase or all lowercase, being unusually long, or input strings
being all "a" versus all "z" characters so it does appear to be robust.

One of the oddest things found in exercising the variants is finding how
SELECT is so much slower than other conditionals with one compiler. The
three very similiar routines in the attached sample program produce almost
identical timing results with several compilers, but vary by a factor of
50x with one.

For instance using "gfortran-11 -O3 xx.f90" generates nearly identical speeds

upper1 0.12178899999999999   GDSTBHWGEJXIFQZCHODHJHGPPULKHYIUOZKSQKTDFIEXGULFKISDYOGLDYECVFFVIEBXIQ
upper1 0.11874299999999999   EMRWHTBCIFFGFIIFCYFFIHXETAHVSHJWWBDDXGYMIQUZNRYZZLHGLTIHOIXWOKOUIEITXS
upper1 0.11874199999999999   FIITOWKRARTRPZRBHKKBXJUYNETBFVAEFAGGEAUWMULVISLYGBOWIDBQCADEWVVXKTXLKY
upper2 0.11873000000000000   JKTFUTMWELFECCATZWYVMBWQNIMYIUJZTPAYBDZXWRERVRFLQQPMEUJUINBCXIWUZIQKEF
upper2 0.11889000000000000   NEDVFZXVQOHILVNWTDWOYIKYCWABLWVOFQVXUOBNYERZLFEPMYKRCVNCCMUBEAJRJFVJVW
upper2 0.11880399999999991   HPWHTMEYFJCHMUDCTUEHWDLSXFRDUFIMVDUKRZFTJEWQSCSVRSDVVIEDPNNNCRWSECAXNB
upper3 0.11881100000000000   FZVYQAKJANYFWUSKOMRRBVBMGODZYXRCEMVTFEWHUHVPSCKSTUMXOFPHASOQSXHTBMEAEQ
upper3 0.11873899999999993   UGOHQOXEDONSRNTGSHWASEFNDSIUAQCHQWCBPYVVWFHRVMEUKKSJXWHNSLUBODDCUBXTDP
upper3 0.11875000000000013   DCQXXWFCEBJNLVWJGBDMOZEUTXEWTHDIDQRXIRKYVOTLSMSSQNOSDINMCQRWVPXRTCQOTU

But with one compiler testing with more complex cases shows a continuing
lack of optimization in SELECT CASE I so far have not seen with any
other compiler. Curious if that is the case with any other compilers.

upper1 .1277940000000000   IBOYRMGCBPRTKMNIHWYAWQFJVBBDTZKFHYMIYTOYHQMOWLICNJRSXFGKFKNPTISXKIKXFM
upper1 .1221970000000000   FDGRGVREJPKUKOJTFTNLFTHFQITLCRYUEFLWUXBGMIKOWWSHNOMQPQJOTRALYCSICVKBXI
upper1 .1221530000000000   YKFXWOKTDYEKMJVGYNSQBEYFVGPVORGQWXCVVMROURMUUGDSOJJFGWGFGJSIRKLJVTRWVZ
upper2 4.660762000000000   JQISSDTHYABHYPTLNNRKJRICWQVXCLXCLJDLGSOZQFSQUUUXOMWSBUHIEIEUDBDYNRMXQN
upper2 4.387915000000000   DMJORPNRNPGOZRUYMOVJNNIVKJIRVSSYBJSVPGDFHWUKYLZPRDYJWVJCCXGBBBCLNOCQOT
upper2 4.302734000000001   IIMFYEFONONPFYHCGRUFHZRMNRPCKBWHOUYEVNBJJUHRXYXUDKEXJVTXJKTTDMRLYCIGTH
upper3 .5937310000000000   NROPNTNQAPRATPGPBTBALEGKQORSLAGFTDIRWMOLFDADTIUNUIWDHGHMIBUJBJSBQCMDNU
upper3 .4668369999999999   RIHVWCUQKFPYVNWJSPSXDJZYTHXXSEMPZWPEWIWYEPWEQMPLWTHYAFCJCMPTFSVKUHMYEZ
upper3 .5521539999999998   EYNGKUDPMBPHHBIITVWQTUSYEPIWQVBVDTMLTJBRIMKNDLVKDOEMNHXVHCDJBSFOIBQNBT

[details="Example reproducer code"]

module M_upper
   use, intrinsic :: iso_fortran_env, only: stdout => output_unit, stderr => error_unit, stdin => input_unit
   use, intrinsic :: iso_fortran_env, only: int8, int16, int32, int64, real32, real64, real128
   use, intrinsic :: iso_fortran_env, only: compiler_version, compiler_options
   implicit none
   private
   public :: say_hello, random_string, upper1, upper2, upper3
contains

   subroutine say_hello
      print *, "Hello, time_case!"
      print '(4a)', &
         'This file was compiled by ', &
         compiler_version(), &
         ' using the options ', &
         compiler_options()
   end subroutine say_hello

   pure function upper1(str) result(string)
      character(*), intent(in)      :: str
      character(len(str))           :: string
      integer                       :: i
      integer(kind=int8), parameter :: ade_a = iachar('a'), ade_z = iachar('z')
      integer(kind=int8), parameter :: diff = iachar('A', kind=int8) - iachar('a', kind=int8)
      integer(kind=int8)            :: ade_char
      do concurrent(i=1:len(str))
         ade_char = iachar(str(i:i), int8)
         if (ade_char >= ade_a .and. ade_char <= ade_z) ade_char = ade_char + diff
         string(i:i) = achar(ade_char)
      end do
      if (len(str) .eq. 0) string = str
   end function upper1

   pure function upper2(str) result(string)
      character(*), intent(in)      :: str
      character(len(str))           :: string
      integer                       :: i
      integer(kind=int8), parameter :: diff = iachar('A', kind=int8) - iachar('a', kind=int8)
      do concurrent(i=1:len(str))
         select case (str(i:i))
         case ('a':'z'); string(i:i) = achar(iachar(str(i:i), kind=int8) + diff)
         case default; string(i:i) = str(i:i)
         end select
      end do
      if (len(str) .eq. 0) string = str
   end function upper2

   pure function upper3(str) result(string)
      character(*), intent(in)      :: str
      character(len(str))           :: string
      integer                       :: i
      integer(kind=int8)            :: ch
      integer(kind=int8), parameter :: diff = iachar('A', kind=int8) - iachar('a', kind=int8)
      integer(kind=int8), parameter :: ade_a = iachar('a'), ade_z = iachar('z')
      do concurrent(i=1:len(str))
         ch = iachar(str(i:i), kind=int8)
         select case (ch)
         case (ade_a:ade_z); string(i:i) = achar(ch + diff)
         case default; string(i:i) = str(i:i)
         end select
      end do
      if (len(str) .eq. 0) string = str
   end function upper3

   function random_string(chars, length) result(out)
      character(len=*), intent(in)     :: chars
      integer, intent(in)              :: length
      character(len=:), allocatable    :: out
      real                         :: x
      integer                      :: ilen
      integer                      :: which
      integer                      :: i
      ilen = len(chars)
      out = ''
      if (ilen .gt. 0) then
         do i = 1, length
            call random_number(x)
            which = nint(real(ilen - 1)*x) + 1
            out = out//chars(which:which)
         end do
      end if
   end function random_string

end module M_upper
program main
   use, intrinsic :: iso_fortran_env, only: real64
   use M_upper, only: upper1, upper2, upper3, random_string,say_hello
   implicit none

    !! define an abstract template defining the procedures
   abstract interface
      function func (str)
         character(len=*), intent (in) :: str
         character(len=len(str))       :: func
      end function func
   end interface

   !! define a pointer of the abstract type
   procedure (func), pointer :: f_ptr => null()

   call say_hello()

      f_ptr => upper1; call timeit('upper1')
      f_ptr => upper2; call timeit('upper2')
      f_ptr => upper3; call timeit('upper3')

contains

subroutine timeit(name)
character(len=*),intent(in) :: name
character(len=:),allocatable :: str, in
integer,parameter            :: calls=1000000
integer                      :: i, j
real(kind=real64)            :: time_start, time_finish
character(len=*),parameter   :: gen='(*(g0,1x))'
   ! use some random calls to help prevent compiler from optimzing away loops
   do j=1,3
      call setin(j,in)
      call cpu_time(time_start)
      do i = 1, calls
         str = f_ptr(in)
         if(i.eq.calls/2) call setin(j,in)
      end do
      call cpu_time(time_finish)
      print gen, name, time_finish - time_start, ' ', str
   enddo
end subroutine timeit

subroutine setin(j,dat)
character(len=:),intent(out),allocatable :: dat
integer,intent(in) :: j
   select case(j) 
      case(1)     ;dat = random_string('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', 70)
      case(2)     ;dat = random_string('abcdefghijklmnopqrstuvwxyz', 70)
      case(3)     ;dat = random_string('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 70)
      case default;dat = random_string('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', 70)
   end select
end subroutine setin

end program main

[/details]

jalvesz · 2023-03-27T06:40:57Z

Yes, this is not the first time I have been disappointed with select case. I do like it for high level branching, in some cases the code is easier to read. but for low level computations I avoid it at all costs. With a colleague we tested it once to replace a long list of IFs statements in an intensive computation and the results were just disappointing. So yes, I just use it for non critical code.

I took your last test and tried out:

	gfortran9.4.0 (linux)	gfortran12.2.0 (linux)	ifort19 (linux)	nvfortran23 (linux)	ifort19 (windows)	ifort23 (windows)	ifx23 (windows)
upper1	2.556E-02	2.544E-02	2.851E-02	1.006E-01	3.125E-02	3.125E-02	1.563E-02
upper1	2.499E-02	2.547E-02	2.662E-02	9.864E-02	3.125E-02	3.125E-02	1.563E-02
upper1	2.477E-02	2.598E-02	2.814E-02	1.006E-01	3.125E-02	3.125E-02	1.563E-02
upper2	2.592E-02	2.620E-02	9.370E-01	1.091E-01	1.000E+00	9.219E-01	8.438E-01
upper2	2.466E-02	2.685E-02	7.957E-01	1.304E-01	9.063E-01	7.813E-01	7.813E-01
upper2	2.469E-02	2.598E-02	7.466E-01	9.055E-02	8.438E-01	7.344E-01	7.344E-01
upper3	2.472E-02	2.580E-02	1.064E-01	1.017E-01	1.094E-01	1.094E-01	7.813E-02
upper3	2.488E-02	2.585E-02	8.743E-02	9.393E-02	9.375E-02	7.813E-02	6.250E-02
upper3	2.497E-02	2.574E-02	1.023E-01	8.955E-02	1.094E-01	1.094E-01	6.250E-02

urbanjost · 2023-03-27T13:04:25Z

so nvfortran and gfortran give consistent speeds but the ifort/ifx can be a factor of 50x. That is bad enough for me to look at the generated code and bring it up on the Intel Fortran forum. I saw that in the past when SELECT CASE was a new feature for a lot of compilers but especially for such a simple select that is horrible. Thanks for the tests. The upper1 being the one you proposed it holds up once again as a good general choice. If anyone can the stdlib procedure versus the one proposed here it would be very informative for nagfor, crayftn, ... for stdlib in general it would be nice to have a cd/ci interface to macos, windows, openbsd, linux, ... and various compilers so if you just set up and pushed something set up to do a "fpm test --compiler $NAME --profile release it would run on a lot of platforms. I have a .github directory on https://github.com/urbanjost/easy that was my attempt at that that works with ifort on linux, gfortran on macos, gfortran on linux, and gfortran on several MSWindows configurations that can easily run different gfortran versions as well but I am not a CD/CI wiz on github; that should be able to do nvfortran too.

It seems like there could be an empty github repository that if you pushed a mini fpm project at it it could run a bunch of environments just using "fpm test" so the scripts would not have to change. A simple case like this shows how huge the variation can be in the performance of the same code

jalvesz mentioned this issue Aug 28, 2023

faster to_lower and to_upper #733

Merged

jvdp1 closed this as completed Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster to_lower/to_upper implementations #703

Faster to_lower/to_upper implementations #703

jalvesz commented Mar 24, 2023

urbanjost commented Mar 25, 2023

jalvesz commented Mar 25, 2023

urbanjost commented Mar 25, 2023 •

edited

Loading

jalvesz commented Mar 26, 2023 •

edited

Loading

urbanjost commented Mar 26, 2023

jalvesz commented Mar 26, 2023 •

edited

Loading

urbanjost commented Mar 27, 2023

jalvesz commented Mar 27, 2023 •

edited

Loading

urbanjost commented Mar 27, 2023

Faster to_lower/to_upper implementations #703

Faster to_lower/to_upper implementations #703

Comments

jalvesz commented Mar 24, 2023

urbanjost commented Mar 25, 2023

jalvesz commented Mar 25, 2023

urbanjost commented Mar 25, 2023 • edited Loading

jalvesz commented Mar 26, 2023 • edited Loading

urbanjost commented Mar 26, 2023

jalvesz commented Mar 26, 2023 • edited Loading

urbanjost commented Mar 27, 2023

jalvesz commented Mar 27, 2023 • edited Loading

urbanjost commented Mar 27, 2023

urbanjost commented Mar 25, 2023 •

edited

Loading

jalvesz commented Mar 26, 2023 •

edited

Loading

jalvesz commented Mar 26, 2023 •

edited

Loading

jalvesz commented Mar 27, 2023 •

edited

Loading