Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module for validating ASCII characters and upper/lower conversion #32

Merged
merged 4 commits into from
Dec 24, 2019
Merged

Conversation

ivan-pi
Copy link
Member

@ivan-pi ivan-pi commented Dec 22, 2019

This PR addresses #11 (there are a few open question left there).

Tested with both the gfortran and Intel Fortran compilers for the 'default' (ascii) character set.

The tests are essentially a port of those at https://github.com/dlang/phobos/blob/master/std/ascii.d (hopefully not a licensing issue?).

@ivan-pi ivan-pi changed the title Added module for validating ASCII characters and upper/lower conversion Add module for validating ASCII characters and upper/lower conversion Dec 22, 2019
@certik
Copy link
Member

certik commented Dec 22, 2019

Would you mind rebasing on top of the latest master to pick up the CI tests?

program test_ascii

use stdlib_experimental_error, only: assert
use stdlib_experimental_ascii
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explicitly import the symbols that are being used? I think we should follow that approach of "explicit imports", as in Python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The reason I noticed this is that I came here to see what the public API actually is. As it is a bit hard to immediately see from the module itself, because there is not a single public line, but rather quite a few symbols are decorated with public.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have private declared at the top of all modules, and below one or more lists of public :: ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zbeekman, I agree, it makes it very easy to see what the public symbols are.

@certik
Copy link
Member

certik commented Dec 22, 2019

So I think this looks really nice. The public API seems nicely done, the naming convention I think is good.

Will the is_* functions work with utf8 strings?

The to_upper and to_lower will only work on the ascii parts of an utf8 string I assume.
Python calls these just upper and lower methods (they work with Unicode). In C, they are called toupper and tolower. Finally in C++, Boost has to_upper and to_lower. There is also a nice table here: #11 (comment)

I think using to_upper is readable.

@certik
Copy link
Member

certik commented Dec 22, 2019

Python also has an isupper method. Should we also implement is_upper and is_lower?

@certik
Copy link
Member

certik commented Dec 22, 2019

General question: should we maintain stdlib_ascii and stdlib_unicode, or should we simply have stdlib_string, and all methods would automatically work with utf8?

I think getting the unicode working (as in Python) will be some work, so we can start with stdlib_ascii, then in another PR we can see how many of these extend to unicode and create stdlib_unicode or merge them. Since this is all in experimental, we can do that.

Things like to_upper just need to be extended to support Unicode (utf8), see the Python documentation. Here is an example in Python:

>>> name = "Ondřej Čertík"
>>> name
'Ondřej Čertík'
>>> name.upper()
'ONDŘEJ ČERTÍK'
>>> name.lower()
'ondřej čertík'

You can see the upper and lower works with Unicode characters (utf8).

@ivan-pi
Copy link
Member Author

ivan-pi commented Dec 22, 2019

I think we should have separate modules for ASCII and Unicode characters. In fact using only the intrinsic Fortran character functions (achar and iachar) it is not possible to find say uppercase Slavic letters č, š, ž... I think it will be necessary to interface with C to achieve Unicode support. A second issue is that some preprocessing will be necessary, as not all Fortran compilers support the extended Unicode character set.

The current ascii module already contains is_upper and is_lower characters. At the moment all functions in the ascii module are limited to work on single characters. They are meant as support functions for a separate string module.

I will rebase and import explicitly the public functions for the test driver asap.

@certik
Copy link
Member

certik commented Dec 22, 2019

@ivan-pi yes, if it is not possible to merge unicode with ascii, then we need two modules.

@certik
Copy link
Member

certik commented Dec 22, 2019

The other thing is --- since you used https://github.com/dlang/phobos/blob/434429f273d0359744b6d3ba9db36d3bef1c7593/std/ascii.d as the original, we have to cite their license.

Overall this looks good to me. It would be nice to get some more reviews on this before we merge. @marshallward, @jacobwilliams, @milancurcic do you have any feedback on the API here?

@milancurcic milancurcic self-requested a review December 24, 2019 01:32
Copy link
Member

@milancurcic milancurcic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @ivan-pi!

Interesting format for comments (some documentation generator?) but fine with me.

Good to merge IMO.

@certik certik merged commit 8ef2f81 into fortran-lang:master Dec 24, 2019
@certik
Copy link
Member

certik commented Dec 24, 2019

@milancurcic thanks for the review.

@ivan-pi would you mind updating whitechar in stdlib_experimental_io to use your new functionality please? That should save some code.

@certik certik mentioned this pull request Dec 30, 2019
@dev-zero
Copy link

since I've been pointed here, this project might be interesting: https://github.com/lemire/fastvalidate-utf-8 althought I don't see how that could be implemented in Fortran given missing inline assembly.

@ivan-pi ivan-pi mentioned this pull request Apr 24, 2020
@certik
Copy link
Member

certik commented Apr 24, 2020

@dev-zero thanks! I made a comment in #11 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants