-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash functions #554
Hash functions #554
Conversation
I have translated a number of hash functions from C or C++ to Fortan90, and have incorporated them into the current stdlib disstribution. [ticket: X]
Updated doc/specs/index.md file to include a reference to the hash functions documentation. [ticket: X]
The hash functions require integer values expressed as hexadecimals that map to negatve intgers under normal unsigned integer to twos complement signed integer conversions. This is undefined under the Fortran Standard but, is the default for gfortran 10 and 11 and all recent versions of ifort, but requires the compiler flag, "-fno-range-check" for gfortran 9 and earlier. [ticket: X]
Corrected spelling Corrected spelling.
I fixed mistakes in the stdlib/Makefile.Manual and stdlib/src/Makefile.Manual. [ticket: X]
Changed filenames in stdlib/src/tests/hash_functions/Makefile.manual [ticket: X]
Thank you @wclodius2 for this PR. I will try to review and test it next week. |
Great! |
Changed the comments to: 1. Note that the algorithm was version 2. 2. Correct the comment on the underlying algorithm for NMHASH32X_9to255 gfortran under strict checking didn't like specifying the arguments to NMH_READLE32 and NMH_READLE16 as p(:) and preferred P(1:4) and p(1:2) respectively. NMHASH32X_9to255 was not as endian independent as I liked, so I changed some of its code. [ticket: X]
!! of the following processors are known to implement the required Fortran | ||
!! 2008 extensions and default to runtime wrap around overflow: FLANG, | ||
!! gfortran, ifort, and NAG Fortran. Older versions of gfortran will require | ||
!! the compiler flag, `-fno-range-check`, to ensure wrap around semantics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which versions of gfortran require the flag? I think stdlib might only support gfortran 9+ -- if those versions do not need it, then I would lean toward removing it.
While perhaps not a big deal, it could lead to complications (e.g. if one compiles stdlib with user-specified flags, this could be missed, so might require documentation in the main README or similar).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version 9 and earlier require the flag.
@@ -21,6 +21,7 @@ if(CMAKE_Fortran_COMPILER_ID STREQUAL GNU) | |||
endif() | |||
add_compile_options(-fimplicit-none) | |||
add_compile_options(-ffree-line-length-132) | |||
add_compile_options(-fno-range-check) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment in the documentation about this -- if it isn't needed for gfortran 9+ it might be better not to have this -- otherwise if it is needed, we should consider whether it needs documenting in the README (as it is easy to compile stdlib with user-specified flags, which could miss this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is required for version 9.
@@ -1,7 +1,7 @@ | |||
# Fortran stdlib Makefile | |||
|
|||
FC ?= gfortran | |||
FFLAGS ?= -Wall -Wextra -Wimplicit-interface -fPIC -g -fcheck=all | |||
FFLAGS ?= -Wall -Wextra -Wimplicit-interface -fPIC -g -fcheck=all -fno-range-check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per previous comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my previous comments.
integer(int64) :: i, j, r | ||
|
||
! base mixer: [f0d9649b 5 -13 29a7935d -9 11 55d35831 -20 -10 ] = | ||
! 0.93495901789135362 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is from the original code. I think it implies that the code is equivalent to multiplying by the floating point number and rounding (down?).
integer(int32), intent(in) :: seed | ||
integer(int32) :: hash | ||
|
||
! [bdab1ea9 18 a7896a1b 12 83796a2d 16] = 0.092922873297662509 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see it's related to the operations below. Perhaps just a brief initial sentence here and elsewhere, to aid those of us without experience in hashing.
|
||
select case(remainder) | ||
case(15) | ||
go to 115 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this situation, is there any natural alternative to the goto's? I do not wish to be dogmatic about it -- but if it can be avoided without much complication, that is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corresponding C code was implemented with switches with no breaks. I could implement it with a computed go to but that is not encouraged in the current language. I could also implement it with a sequence of if statements, but that introduces some overhead for unnecessary tests, unless the compiler is very smart about optimization.
@@ -0,0 +1,190 @@ | |||
program test_32_bit_hash_performance | |||
!! Program to compare the relative performance of different 32 bit hash | |||
!! functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there also tests of correctness? We would want the tests to fail if a bug is introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tests for correctness that compare with the output of the corresponding C codes, but the tests are outside the standard library as they require invoking C and C++ compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I am not an expert on hashing, but have a couple of questions. However my general impression is that the documentation is impressive, as is the functionality.
I am not sure though whether the testing is sufficient (i.e. if we introduced some bugs, would failures be detected?). This might just reflect my limited knowledge of hashing.
I can add the validation test codes to the distribution, but they will not be compiled for the Standard Library as:
|
doc/specs/stdlib_hash_functions.md
Outdated
The `stdlib_32_bit_hash_functions` module provides procedures to | ||
compute 32 bit integer hash codes and a scalar hash. | ||
The hash codes are useful for tables of up to `2**15` entries, and | ||
for keys with a few hundred elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this. I think the first part means useful for indexing hash tables with up to 2**15 entries
? But I'm not sure what keys
are -- is this the object for which we compute a hash (so an array of integers or a character string)? Likely some slight rewording will make that clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be changed. Keys are what is hashed. The true identifier of the data. I have been testing my hash tables with these codes and they seem to be working well of 2**16
random elements. Let me update the documentation after a little more feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the discussion to try to clarify the issues.
Sure. IMO it's important to have some tests that will fail when bugs are introduced. But I'm not sure what the best approach is. A few possibilities:
|
The current validation tests are computationally intensive, and may raise some objections if they are run for each installation. I am open to allowing C/C++ in the repository, but then, for some of the library codes, maybe they should be replaced by wrappers to original C, C++, etc. procedures. Regression tests normally verify whether changes in the code results in changes from the original, and not whether the original is correct. I guess we could generate binary files from the original C/C++ codes and compare against them. However I have been using a random number generator to generate my current test keys to ensure variability in the keys. I suppose we could also have a binary file of keys to test against, but we might need separate files for little endian versus big endian processors. The third option would be easier on me. |
Needed to deal with the changes in the specification of the version of fpm to be used in processing the branch.
Gareth found some of the discussion ambiguous so I tried to make it clearer. [ticket: X]
Added the directory, validation, that contains a makefile and source code for three applications to be run in sequence to test the Fortran versions of the hash functions in libstdlib.a.against the original C/C++ versions. [ticket: X]
Fixed the suffix of the README file, and slightly modified the contents. [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
[ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
what happened [ticket: X]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are my first questions/comments/suggestions.
First this is a impressive job, @wclodius2 .
I didn't interpret the standard on the bit model like you. I asked a few questions for more details.
I have managed to mess-up the conversion to Personal Access Tokens. I have validation codes for the hash functions that I am unable to push at the moment. Once I am back to being able to push I will address Jeremie's response. |
I am having problems with the repository. I have had trouble with the conversion to the Personal Access Token, and have upgraded to a MacBook with an M1 processor. The upgrade has forced me to incorporate extensive changes made in the library (by Sebastian Ehlert) to deal with the lack of quad precision for Fortran on the M1 processor. In order to incorporate the revisions and try to work around the problems that started with the PAT change, I decided to clone the current version of the repository and manually incorporate the changes I made to implement hash functions. So I entered the following:
I then compiled the current standard library's code with the M1 capable version of Fortran, and all went well. I then proceeded to incorporate my hash function code (using add and commit with no reported problems) and it compiled properly. The time then came to push the changes, so I did the following:
and got the unexpected response
A cursory search on the internet did not indicate to me how to proceed. DO you have any ideas or know of anyone that might know how to proceed to to start a PR on the hash_functions2 branch? FWIW
|
@wclodius2 The url for the remote "william" is wrong. To modify it, inside the git remote set-url william [email protected]:wclodius2/stdlib.git then you should be able to push your branch to git push william hash_functions2 I hope that this helps. |
It did thanks. I have updated the code to reflect my response to the comments of Jeremie (@jvdp1), created a new branch hash_functions2, and generated a new pull request. Sorry for the new branch and pull request. I will be closing this PR in a couple of days after people have a chance to respond to my responses for my requests for changes, and the generation of the new PR. |
Thank you @wclodius2 for your answers. I left a few other comments. @gareth-nx could you check @wclodius2 's answers to your comments, please? If they satisfy you, let us know, please. Then I will close this PR in favour of #573 |
Hi @jvdp1 and @wclodius2 I'm happy with the answers to my comments. To my understanding the biggest outstanding issue is what should be in the test-suite (although I know @wclodius2 has made some additional changes related to that, which I haven't checked). So it's fine with me if we close this. I actually started doing a review in the other PR, and added a comment just now linking to this discussion, so we can find it easily. |
Hi @gareth-nx, for the validation code check out
Note: The makefile, |
Thank you all. I will close this PR in favour of #573 |
I have created a number of hash functions for incorporation in the Fortran standard library by translating from C or C++ a number of algorithms recommended by Reini Urban who maintains a well respected code,
SMHasher
, for testing the properties and performance of hash functions. The codes can hash default character strings and integer vectors. The hash functions are:fibonacci hash
: a scalar hash used to map 32 or 64 bit integers to power of two sized hash tables.FNV-1
andFNV-1a
: simple hash functions with excellent performance for small keys, whose poor statistical properties appear to have only a minor impact on the performance of small hash tables.nmhash32
andnmhash32x
: 32 bit hashes with excellent statistical properties, but whose performance, so far, seems to be poor.water_hash
: a 32 bit hash function with excellent statistical performance and very good performance on large keys.pengy hash
: a 64 bit hash function with excellent statistical performance and good performance on large keys.SpookyV2 hash
: a 128 bit hash with excellent statistical performance and excellent performance on large keys.I have other hash functions that can potentially be incorporated in the library. Of these the most interesting is probably
WY hash
a 64 bit hash function with excellent statistical properties and performance, but a reputation for many bad seeds that I haven't figured out how to avoid in its seed generator. I also have Larsen and Sedgewick hashes as possible replacements/alternatives to the FNV hashes if any one can show that they are public domain. Lookup3, murmur2, and murmur3 are high performance on large keys, but SMHasher claims they have poor statistics and large numbers of bad seeds. Mir hash has poor performance and bad seeds. MX3 has poor performance.