Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash functions #554

Closed
wants to merge 37 commits into from
Closed

Conversation

wclodius2
Copy link
Contributor

@wclodius2 wclodius2 commented Oct 19, 2021

I have created a number of hash functions for incorporation in the Fortran standard library by translating from C or C++ a number of algorithms recommended by Reini Urban who maintains a well respected code, SMHasher, for testing the properties and performance of hash functions. The codes can hash default character strings and integer vectors. The hash functions are:

  • fibonacci hash: a scalar hash used to map 32 or 64 bit integers to power of two sized hash tables.
  • 32 and 64 bit versions of the FNV-1 and FNV-1a: simple hash functions with excellent performance for small keys, whose poor statistical properties appear to have only a minor impact on the performance of small hash tables.
  • nmhash32 and nmhash32x: 32 bit hashes with excellent statistical properties, but whose performance, so far, seems to be poor.
  • water_hash: a 32 bit hash function with excellent statistical performance and very good performance on large keys.
  • pengy hash: a 64 bit hash function with excellent statistical performance and good performance on large keys.
  • SpookyV2 hash: a 128 bit hash with excellent statistical performance and excellent performance on large keys.
  • A number of random seed generators for the above hashes.

I have other hash functions that can potentially be incorporated in the library. Of these the most interesting is probably WY hash a 64 bit hash function with excellent statistical properties and performance, but a reputation for many bad seeds that I haven't figured out how to avoid in its seed generator. I also have Larsen and Sedgewick hashes as possible replacements/alternatives to the FNV hashes if any one can show that they are public domain. Lookup3, murmur2, and murmur3 are high performance on large keys, but SMHasher claims they have poor statistics and large numbers of bad seeds. Mir hash has poor performance and bad seeds. MX3 has poor performance.

I have translated a number of hash functions from C or C++ to Fortan90, and
have incorporated them into the current stdlib disstribution.

[ticket: X]
Updated	doc/specs/index.md file to include a reference to the hash functions
documentation.

[ticket: X]
The hash functions require integer values expressed as hexadecimals that map
to negatve intgers under normal unsigned integer to twos complement signed
integer conversions. This is undefined under the Fortran Standard but, is the
default for gfortran 10 and 11 and all recent versions of ifort, but requires
the compiler flag, "-fno-range-check" for gfortran 9 and earlier.

[ticket: X]
Corrected spelling

Corrected spelling.
I fixed mistakes in the stdlib/Makefile.Manual and stdlib/src/Makefile.Manual.

[ticket: X]
Changed filenames in stdlib/src/tests/hash_functions/Makefile.manual

[ticket: X]
@jvdp1 jvdp1 added reviewers needed This patch requires extra eyes topic: algorithms searching and sorting, merging, ... labels Oct 20, 2021
@jvdp1
Copy link
Member

jvdp1 commented Oct 20, 2021

Thank you @wclodius2 for this PR. I will try to review and test it next week.

@wclodius2
Copy link
Contributor Author

Great!

Changed the comments to:
1. Note that the algorithm was version 2.
2. Correct the comment on the underlying algorithm for NMHASH32X_9to255

gfortran under strict checking didn't like specifying the arguments to
NMH_READLE32 and NMH_READLE16 as p(:) and preferred P(1:4) and p(1:2)
respectively.

NMHASH32X_9to255 was not as endian independent as I liked, so I changed some
of its code.

[ticket: X]
!! of the following processors are known to implement the required Fortran
!! 2008 extensions and default to runtime wrap around overflow: FLANG,
!! gfortran, ifort, and NAG Fortran. Older versions of gfortran will require
!! the compiler flag, `-fno-range-check`, to ensure wrap around semantics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which versions of gfortran require the flag? I think stdlib might only support gfortran 9+ -- if those versions do not need it, then I would lean toward removing it.

While perhaps not a big deal, it could lead to complications (e.g. if one compiles stdlib with user-specified flags, this could be missed, so might require documentation in the main README or similar).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version 9 and earlier require the flag.

@@ -21,6 +21,7 @@ if(CMAKE_Fortran_COMPILER_ID STREQUAL GNU)
endif()
add_compile_options(-fimplicit-none)
add_compile_options(-ffree-line-length-132)
add_compile_options(-fno-range-check)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in the documentation about this -- if it isn't needed for gfortran 9+ it might be better not to have this -- otherwise if it is needed, we should consider whether it needs documenting in the README (as it is easy to compile stdlib with user-specified flags, which could miss this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is required for version 9.

@@ -1,7 +1,7 @@
# Fortran stdlib Makefile

FC ?= gfortran
FFLAGS ?= -Wall -Wextra -Wimplicit-interface -fPIC -g -fcheck=all
FFLAGS ?= -Wall -Wextra -Wimplicit-interface -fPIC -g -fcheck=all -fno-range-check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per previous comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comments.

integer(int64) :: i, j, r

! base mixer: [f0d9649b 5 -13 29a7935d -9 11 55d35831 -20 -10 ] =
! 0.93495901789135362
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is from the original code. I think it implies that the code is equivalent to multiplying by the floating point number and rounding (down?).

integer(int32), intent(in) :: seed
integer(int32) :: hash

! [bdab1ea9 18 a7896a1b 12 83796a2d 16] = 0.092922873297662509
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see it's related to the operations below. Perhaps just a brief initial sentence here and elsewhere, to aid those of us without experience in hashing.


select case(remainder)
case(15)
go to 115
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this situation, is there any natural alternative to the goto's? I do not wish to be dogmatic about it -- but if it can be avoided without much complication, that is good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corresponding C code was implemented with switches with no breaks. I could implement it with a computed go to but that is not encouraged in the current language. I could also implement it with a sequence of if statements, but that introduces some overhead for unnecessary tests, unless the compiler is very smart about optimization.

@@ -0,0 +1,190 @@
program test_32_bit_hash_performance
!! Program to compare the relative performance of different 32 bit hash
!! functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there also tests of correctness? We would want the tests to fail if a bug is introduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tests for correctness that compare with the output of the corresponding C codes, but the tests are outside the standard library as they require invoking C and C++ compilers.

Copy link
Contributor

@gareth-nx gareth-nx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I am not an expert on hashing, but have a couple of questions. However my general impression is that the documentation is impressive, as is the functionality.

I am not sure though whether the testing is sufficient (i.e. if we introduced some bugs, would failures be detected?). This might just reflect my limited knowledge of hashing.

@wclodius2
Copy link
Contributor Author

I can add the validation test codes to the distribution, but they will not be compiled for the Standard Library as:

  1. They invoke a C and C++ compiler, unlike the rest of the library.
  2. They rely on a manual makefile and not CMake
  3. The makefile was developed organically and is a bit of a mess.

The `stdlib_32_bit_hash_functions` module provides procedures to
compute 32 bit integer hash codes and a scalar hash.
The hash codes are useful for tables of up to `2**15` entries, and
for keys with a few hundred elements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this. I think the first part means useful for indexing hash tables with up to 2**15 entries? But I'm not sure what keys are -- is this the object for which we compute a hash (so an array of integers or a character string)? Likely some slight rewording will make that clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be changed. Keys are what is hashed. The true identifier of the data. I have been testing my hash tables with these codes and they seem to be working well of 2**16 random elements. Let me update the documentation after a little more feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the discussion to try to clarify the issues.

@gareth-nx
Copy link
Contributor

I can add the validation test codes to the distribution, but they will not be compiled for the Standard Library as:

1. They invoke a C and C++ compiler, unlike the rest of the library.

2. They rely on a manual makefile and not CMake

3. The makefile was developed organically and is a bit of a mess.

Sure. IMO it's important to have some tests that will fail when bugs are introduced. But I'm not sure what the best approach is. A few possibilities:

  • Perhaps now is a good time to consider allowing C/C++ dependencies in stdlib. This sounds like the cleanest and most rigorous option.
  • Is there a simple alternative to the C-based validation tests (such as a handful of regression tests)?
  • If stdlib integration is not possible for the 'full' tests, then even if we add more limited regression tests, I think it would be good if the full test suite could be released separately - so users can periodically check that it's still doing what it should.

@wclodius2
Copy link
Contributor Author

The current validation tests are computationally intensive, and may raise some objections if they are run for each installation.

I am open to allowing C/C++ in the repository, but then, for some of the library codes, maybe they should be replaced by wrappers to original C, C++, etc. procedures.

Regression tests normally verify whether changes in the code results in changes from the original, and not whether the original is correct. I guess we could generate binary files from the original C/C++ codes and compare against them. However I have been using a random number generator to generate my current test keys to ensure variability in the keys. I suppose we could also have a binary file of keys to test against, but we might need separate files for little endian versus big endian processors.

The third option would be easier on me.

Needed to deal with the changes in the specification of the version of fpm
to be used in processing the branch.
Gareth found some of the discussion ambiguous so I tried to make it clearer.

[ticket: X]
Added the directory, validation, that contains a makefile and source code
for three applications to be run in sequence to test the Fortran versions
of the hash functions in libstdlib.a.against the original C/C++ versions.

[ticket: X]
Fixed the suffix of the README file, and slightly modified the contents.

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
what happened

[ticket: X]
Copy link
Member

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my first questions/comments/suggestions.

First this is a impressive job, @wclodius2 .

I didn't interpret the standard on the bit model like you. I asked a few questions for more details.

@wclodius2
Copy link
Contributor Author

I have managed to mess-up the conversion to Personal Access Tokens. I have validation codes for the hash functions that I am unable to push at the moment. Once I am back to being able to push I will address Jeremie's response.

@wclodius2
Copy link
Contributor Author

I am having problems with the repository. I have had trouble with the conversion to the Personal Access Token, and have upgraded to a MacBook with an M1 processor. The upgrade has forced me to incorporate extensive changes made in the library (by Sebastian Ehlert) to deal with the lack of quad precision for Fortran on the M1 processor. In order to incorporate the revisions and try to work around the problems that started with the PAT change, I decided to clone the current version of the repository and manually incorporate the changes I made to implement hash functions. So I entered the following:

git clone https://github.com/fortran-lang/stdlib
cd stdlib
git remote add william [email protected]:wclodius2/hash_functions2.git
git checkout -b hash_functions2

I then compiled the current standard library's code with the M1 capable version of Fortran, and all went well. I then proceeded to incorporate my hash function code (using add and commit with no reported problems) and it compiled properly. The time then came to push the changes, so I did the following:

git push william

and got the unexpected response

ERROR: Repository not found.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

A cursory search on the internet did not indicate to me how to proceed. DO you have any ideas or know of anyone that might know how to proceed to to start a PR on the hash_functions2 branch? FWIW more .git/config yields

[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
        ignorecase = true
        precomposeunicode = true
[remote "origin"]
        url = https://github.com/fortran-lang/stdlib
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master
[remote "william"]
        url = [email protected]:wclodius2/hash_functions2.git
        fetch = +refs/heads/*:refs/remotes/william/*

@jvdp1
Copy link
Member

jvdp1 commented Nov 22, 2021

I am having problems with the repository. I have had trouble with the conversion to the Personal Access Token, and have upgraded to a MacBook with an M1 processor. The upgrade has forced me to incorporate extensive changes made in the library (by Sebastian Ehlert) to deal with the lack of quad precision for Fortran on the M1 processor. In order to incorporate the revisions and try to work around the problems that started with the PAT change, I decided to clone the current version of the repository and manually incorporate the changes I made to implement hash functions. So I entered the following:

git clone https://github.com/fortran-lang/stdlib
cd stdlib
git remote add william [email protected]:wclodius2/hash_functions2.git
git checkout -b hash_functions2

I then compiled the current standard library's code with the M1 capable version of Fortran, and all went well. I then proceeded to incorporate my hash function code (using add and commit with no reported problems) and it compiled properly. The time then came to push the changes, so I did the following:

git push william

and got the unexpected response

ERROR: Repository not found.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

A cursory search on the internet did not indicate to me how to proceed. DO you have any ideas or know of anyone that might know how to proceed to to start a PR on the hash_functions2 branch? FWIW more .git/config yields

[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
        ignorecase = true
        precomposeunicode = true
[remote "origin"]
        url = https://github.com/fortran-lang/stdlib
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master
[remote "william"]
        url = [email protected]:wclodius2/hash_functions2.git
        fetch = +refs/heads/*:refs/remotes/william/*

@wclodius2 The url for the remote "william" is wrong. To modify it, inside the stdlib directory

git remote set-url william [email protected]:wclodius2/stdlib.git

then you should be able to push your branch to william as:

git push william hash_functions2

I hope that this helps.

@wclodius2
Copy link
Contributor Author

It did thanks. I have updated the code to reflect my response to the comments of Jeremie (@jvdp1), created a new branch hash_functions2, and generated a new pull request. Sorry for the new branch and pull request. I will be closing this PR in a couple of days after people have a chance to respond to my responses for my requests for changes, and the generation of the new PR.

@jvdp1
Copy link
Member

jvdp1 commented Nov 24, 2021

Thank you @wclodius2 for your answers. I left a few other comments.

@gareth-nx could you check @wclodius2 's answers to your comments, please? If they satisfy you, let us know, please. Then I will close this PR in favour of #573

@gareth-nx
Copy link
Contributor

Hi @jvdp1 and @wclodius2

I'm happy with the answers to my comments. To my understanding the biggest outstanding issue is what should be in the test-suite (although I know @wclodius2 has made some additional changes related to that, which I haven't checked).

So it's fine with me if we close this. I actually started doing a review in the other PR, and added a comment just now linking to this discussion, so we can find it easily.

@wclodius2
Copy link
Contributor Author

Hi @gareth-nx, for the validation code check out .../stdlib/src/tests/hash_functions/validation. The validation code can be thought of as consisting of three executables:

  1. generate_key_array, which generates a file, key_array.bin, containing a sequence of 2048 random INT8 integers;
  2. generate_hash_arrays, which reads in the file, key_array.bin, and generates a file for each complicated hashing method containing the sequence of hash values generated by the C/C++ versions of the hash code, to the first 0, 1, 2, ..., 2048 eight bit integers in key_array.bin; and
  3. hash_validity_test, which reads in the file, key_array.bin, and then, for each complicated hashing method, compares the contents of the corresponding file generated by generate_hash_arrays, with the hash value generated by using the Fortran hash function on the corresponding sequence of eight bit integers.If the bits of the corresponding hash values are not identical it stops with an error message.

Note: The makefile, Makefile.validation, generates the three executables from the source code in the file. Currently the makefile is specialized for the GCC compiler suite, as the Intel C compiler cannot compile one of the source files, that is designed to be compiled by either gcc or MSVC.

@jvdp1
Copy link
Member

jvdp1 commented Nov 27, 2021

Thank you all. I will close this PR in favour of #573

@jvdp1 jvdp1 closed this Nov 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reviewers needed This patch requires extra eyes topic: algorithms searching and sorting, merging, ...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants