Hash functions #554

wclodius2 · 2021-10-19T21:30:18Z

I have created a number of hash functions for incorporation in the Fortran standard library by translating from C or C++ a number of algorithms recommended by Reini Urban who maintains a well respected code, SMHasher, for testing the properties and performance of hash functions. The codes can hash default character strings and integer vectors. The hash functions are:

fibonacci hash: a scalar hash used to map 32 or 64 bit integers to power of two sized hash tables.
32 and 64 bit versions of the FNV-1 and FNV-1a: simple hash functions with excellent performance for small keys, whose poor statistical properties appear to have only a minor impact on the performance of small hash tables.
nmhash32 and nmhash32x: 32 bit hashes with excellent statistical properties, but whose performance, so far, seems to be poor.
water_hash: a 32 bit hash function with excellent statistical performance and very good performance on large keys.
pengy hash: a 64 bit hash function with excellent statistical performance and good performance on large keys.
SpookyV2 hash: a 128 bit hash with excellent statistical performance and excellent performance on large keys.
A number of random seed generators for the above hashes.

I have other hash functions that can potentially be incorporated in the library. Of these the most interesting is probably WY hash a 64 bit hash function with excellent statistical properties and performance, but a reputation for many bad seeds that I haven't figured out how to avoid in its seed generator. I also have Larsen and Sedgewick hashes as possible replacements/alternatives to the FNV hashes if any one can show that they are public domain. Lookup3, murmur2, and murmur3 are high performance on large keys, but SMHasher claims they have poor statistics and large numbers of bad seeds. Mir hash has poor performance and bad seeds. MX3 has poor performance.

I have translated a number of hash functions from C or C++ to Fortan90, and have incorporated them into the current stdlib disstribution. [ticket: X]

Updated doc/specs/index.md file to include a reference to the hash functions documentation. [ticket: X]

The hash functions require integer values expressed as hexadecimals that map to negatve intgers under normal unsigned integer to twos complement signed integer conversions. This is undefined under the Fortran Standard but, is the default for gfortran 10 and 11 and all recent versions of ifort, but requires the compiler flag, "-fno-range-check" for gfortran 9 and earlier. [ticket: X]

Corrected spelling Corrected spelling.

I fixed mistakes in the stdlib/Makefile.Manual and stdlib/src/Makefile.Manual. [ticket: X]

Changed filenames in stdlib/src/tests/hash_functions/Makefile.manual [ticket: X]

jvdp1 · 2021-10-20T17:28:08Z

Thank you @wclodius2 for this PR. I will try to review and test it next week.

wclodius2 · 2021-10-20T17:46:57Z

Great!

Changed the comments to: 1. Note that the algorithm was version 2. 2. Correct the comment on the underlying algorithm for NMHASH32X_9to255 gfortran under strict checking didn't like specifying the arguments to NMH_READLE32 and NMH_READLE16 as p(:) and preferred P(1:4) and p(1:2) respectively. NMHASH32X_9to255 was not as endian independent as I liked, so I changed some of its code. [ticket: X]

gareth-nx · 2021-11-05T10:18:02Z

src/stdlib_32_bit_water_hashes.fypp

+!! of the following processors are known to implement the required Fortran
+!! 2008 extensions and default to runtime wrap around overflow: FLANG,
+!! gfortran, ifort, and NAG Fortran. Older versions of gfortran will require
+!! the compiler flag, `-fno-range-check`, to ensure wrap around semantics


Which versions of gfortran require the flag? I think stdlib might only support gfortran 9+ -- if those versions do not need it, then I would lean toward removing it.

While perhaps not a big deal, it could lead to complications (e.g. if one compiles stdlib with user-specified flags, this could be missed, so might require documentation in the main README or similar).

Version 9 and earlier require the flag.

gareth-nx · 2021-11-05T10:19:22Z

CMakeLists.txt

@@ -21,6 +21,7 @@ if(CMAKE_Fortran_COMPILER_ID STREQUAL GNU)
  endif()
  add_compile_options(-fimplicit-none)
  add_compile_options(-ffree-line-length-132)
+  add_compile_options(-fno-range-check)


See my comment in the documentation about this -- if it isn't needed for gfortran 9+ it might be better not to have this -- otherwise if it is needed, we should consider whether it needs documenting in the README (as it is easy to compile stdlib with user-specified flags, which could miss this).

It is required for version 9.

gareth-nx · 2021-11-05T10:19:41Z

Makefile.manual

@@ -1,7 +1,7 @@
 # Fortran stdlib Makefile

 FC ?= gfortran
-FFLAGS ?= -Wall -Wextra -Wimplicit-interface -fPIC -g -fcheck=all
+FFLAGS ?= -Wall -Wextra -Wimplicit-interface -fPIC -g -fcheck=all -fno-range-check


As per previous comment

See my previous comments.

doc/specs/index.md

doc/specs/stdlib_hash_functions.md

gareth-nx · 2021-11-05T10:32:06Z

src/stdlib_32_bit_nmhashes.fypp

+        integer(int64) :: i, j, r
+
+        ! base mixer: [f0d9649b  5 -13 29a7935d -9 11 55d35831 -20 -10 ] =
+        ! 0.93495901789135362


I don't understand this comment

It is from the original code. I think it implies that the code is equivalent to multiplying by the floating point number and rounding (down?).

gareth-nx · 2021-11-05T10:33:03Z

src/stdlib_32_bit_nmhashes.fypp

+        integer(int32), intent(in) :: seed
+        integer(int32) :: hash
+
+        ! [bdab1ea9 18 a7896a1b 12 83796a2d 16] = 0.092922873297662509


I don't understand this comment

Oh I see it's related to the operations below. Perhaps just a brief initial sentence here and elsewhere, to aid those of us without experience in hashing.

gareth-nx · 2021-11-05T10:39:55Z

src/stdlib_64_bit_spookyv2_hashes.fypp

+
+        select case(remainder)
+        case(15)
+            go to 115


In this situation, is there any natural alternative to the goto's? I do not wish to be dogmatic about it -- but if it can be avoided without much complication, that is good.

The corresponding C code was implemented with switches with no breaks. I could implement it with a computed go to but that is not encouraged in the current language. I could also implement it with a sequence of if statements, but that introduces some overhead for unnecessary tests, unless the compiler is very smart about optimization.

src/tests/hash_functions/test_64_bit_hash_performance.f90

gareth-nx · 2021-11-05T10:43:45Z

src/tests/hash_functions/test_32_bit_hash_performance.f90

@@ -0,0 +1,190 @@
+program test_32_bit_hash_performance
+!! Program to compare the relative performance of different 32 bit hash
+!! functions


Are there also tests of correctness? We would want the tests to fail if a bug is introduced.

I have tests for correctness that compare with the output of the corresponding C codes, but the tests are outside the standard library as they require invoking C and C++ compilers.

gareth-nx

This looks great. I am not an expert on hashing, but have a couple of questions. However my general impression is that the documentation is impressive, as is the functionality.

I am not sure though whether the testing is sufficient (i.e. if we introduced some bugs, would failures be detected?). This might just reflect my limited knowledge of hashing.

wclodius2 · 2021-11-05T21:10:56Z

I can add the validation test codes to the distribution, but they will not be compiled for the Standard Library as:

They invoke a C and C++ compiler, unlike the rest of the library.
They rely on a manual makefile and not CMake
The makefile was developed organically and is a bit of a mess.

gareth-nx · 2021-11-05T22:44:02Z

doc/specs/stdlib_hash_functions.md

+The `stdlib_32_bit_hash_functions` module provides procedures to
+compute 32 bit integer hash codes and a scalar hash. 
+The hash codes are useful for tables of up to `2**15` entries, and
+for keys with a few hundred elements.


I don't quite understand this. I think the first part means useful for indexing hash tables with up to 2**15 entries? But I'm not sure what keys are -- is this the object for which we compute a hash (so an array of integers or a character string)? Likely some slight rewording will make that clearer.

This needs to be changed. Keys are what is hashed. The true identifier of the data. I have been testing my hash tables with these codes and they seem to be working well of 2**16 random elements. Let me update the documentation after a little more feedback.

I have changed the discussion to try to clarify the issues.

gareth-nx · 2021-11-05T23:02:12Z

I can add the validation test codes to the distribution, but they will not be compiled for the Standard Library as:
1. They invoke a C and C++ compiler, unlike the rest of the library.

2. They rely on a manual makefile and not CMake

3. The makefile was developed organically and is a bit of a mess.

Sure. IMO it's important to have some tests that will fail when bugs are introduced. But I'm not sure what the best approach is. A few possibilities:

Perhaps now is a good time to consider allowing C/C++ dependencies in stdlib. This sounds like the cleanest and most rigorous option.
Is there a simple alternative to the C-based validation tests (such as a handful of regression tests)?
If stdlib integration is not possible for the 'full' tests, then even if we add more limited regression tests, I think it would be good if the full test suite could be released separately - so users can periodically check that it's still doing what it should.

wclodius2 · 2021-11-06T00:14:40Z

The current validation tests are computationally intensive, and may raise some objections if they are run for each installation.

I am open to allowing C/C++ in the repository, but then, for some of the library codes, maybe they should be replaced by wrappers to original C, C++, etc. procedures.

Regression tests normally verify whether changes in the code results in changes from the original, and not whether the original is correct. I guess we could generate binary files from the original C/C++ codes and compare against them. However I have been using a random number generator to generate my current test keys to ensure variability in the keys. I suppose we could also have a binary file of keys to test against, but we might need separate files for little endian versus big endian processors.

The third option would be easier on me.

Needed to deal with the changes in the specification of the version of fpm to be used in processing the branch.

Gareth found some of the discussion ambiguous so I tried to make it clearer. [ticket: X]

Added the directory, validation, that contains a makefile and source code for three applications to be run in sequence to test the Fortran versions of the hash functions in libstdlib.a.against the original C/C++ versions. [ticket: X]

Fixed the suffix of the README file, and slightly modified the contents. [ticket: X]

what happened [ticket: X]

[ticket: X]

what happened [ticket: X]

jvdp1

Here are my first questions/comments/suggestions.

First this is a impressive job, @wclodius2 .

I didn't interpret the standard on the bit model like you. I asked a few questions for more details.

CMakeLists.txt

src/stdlib_32_bit_water_hashes.fypp

doc/specs/stdlib_hash_functions.md

doc/specs/index.md

doc/specs/stdlib_hash_functions.md

src/stdlib_32_bit_hash_functions.fypp

src/tests/hash_functions/validation/hash_validity_test.f90

wclodius2 · 2021-11-18T05:40:51Z

I have managed to mess-up the conversion to Personal Access Tokens. I have validation codes for the hash functions that I am unable to push at the moment. Once I am back to being able to push I will address Jeremie's response.

wclodius2 · 2021-11-22T05:15:30Z

I am having problems with the repository. I have had trouble with the conversion to the Personal Access Token, and have upgraded to a MacBook with an M1 processor. The upgrade has forced me to incorporate extensive changes made in the library (by Sebastian Ehlert) to deal with the lack of quad precision for Fortran on the M1 processor. In order to incorporate the revisions and try to work around the problems that started with the PAT change, I decided to clone the current version of the repository and manually incorporate the changes I made to implement hash functions. So I entered the following:

git clone https://github.com/fortran-lang/stdlib
cd stdlib
git remote add william [email protected]:wclodius2/hash_functions2.git
git checkout -b hash_functions2

I then compiled the current standard library's code with the M1 capable version of Fortran, and all went well. I then proceeded to incorporate my hash function code (using add and commit with no reported problems) and it compiled properly. The time then came to push the changes, so I did the following:

git push william

and got the unexpected response

ERROR: Repository not found.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

A cursory search on the internet did not indicate to me how to proceed. DO you have any ideas or know of anyone that might know how to proceed to to start a PR on the hash_functions2 branch? FWIW more .git/config yields

[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
        ignorecase = true
        precomposeunicode = true
[remote "origin"]
        url = https://github.com/fortran-lang/stdlib
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master
[remote "william"]
        url = [email protected]:wclodius2/hash_functions2.git
        fetch = +refs/heads/*:refs/remotes/william/*

jvdp1 · 2021-11-22T09:53:06Z

I am having problems with the repository. I have had trouble with the conversion to the Personal Access Token, and have upgraded to a MacBook with an M1 processor. The upgrade has forced me to incorporate extensive changes made in the library (by Sebastian Ehlert) to deal with the lack of quad precision for Fortran on the M1 processor. In order to incorporate the revisions and try to work around the problems that started with the PAT change, I decided to clone the current version of the repository and manually incorporate the changes I made to implement hash functions. So I entered the following:
git clone https://github.com/fortran-lang/stdlib
cd stdlib
git remote add william [email protected]:wclodius2/hash_functions2.git
git checkout -b hash_functions2
I then compiled the current standard library's code with the M1 capable version of Fortran, and all went well. I then proceeded to incorporate my hash function code (using add and commit with no reported problems) and it compiled properly. The time then came to push the changes, so I did the following:
git push william
and got the unexpected response
ERROR: Repository not found.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
A cursory search on the internet did not indicate to me how to proceed. DO you have any ideas or know of anyone that might know how to proceed to to start a PR on the hash_functions2 branch? FWIW more .git/config yields
[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
        ignorecase = true
        precomposeunicode = true
[remote "origin"]
        url = https://github.com/fortran-lang/stdlib
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master
[remote "william"]
        url = [email protected]:wclodius2/hash_functions2.git
        fetch = +refs/heads/*:refs/remotes/william/*

@wclodius2 The url for the remote "william" is wrong. To modify it, inside the stdlib directory

git remote set-url william [email protected]:wclodius2/stdlib.git

then you should be able to push your branch to william as:

git push william hash_functions2

I hope that this helps.

wclodius2 · 2021-11-22T14:20:02Z

It did thanks. I have updated the code to reflect my response to the comments of Jeremie (@jvdp1), created a new branch hash_functions2, and generated a new pull request. Sorry for the new branch and pull request. I will be closing this PR in a couple of days after people have a chance to respond to my responses for my requests for changes, and the generation of the new PR.

jvdp1 · 2021-11-24T19:00:14Z

Thank you @wclodius2 for your answers. I left a few other comments.

@gareth-nx could you check @wclodius2 's answers to your comments, please? If they satisfy you, let us know, please. Then I will close this PR in favour of #573

gareth-nx · 2021-11-24T21:37:21Z

Hi @jvdp1 and @wclodius2

I'm happy with the answers to my comments. To my understanding the biggest outstanding issue is what should be in the test-suite (although I know @wclodius2 has made some additional changes related to that, which I haven't checked).

So it's fine with me if we close this. I actually started doing a review in the other PR, and added a comment just now linking to this discussion, so we can find it easily.

wclodius2 · 2021-11-24T23:14:36Z

Hi @gareth-nx, for the validation code check out .../stdlib/src/tests/hash_functions/validation. The validation code can be thought of as consisting of three executables:

generate_key_array, which generates a file, key_array.bin, containing a sequence of 2048 random INT8 integers;
generate_hash_arrays, which reads in the file, key_array.bin, and generates a file for each complicated hashing method containing the sequence of hash values generated by the C/C++ versions of the hash code, to the first 0, 1, 2, ..., 2048 eight bit integers in key_array.bin; and
hash_validity_test, which reads in the file, key_array.bin, and then, for each complicated hashing method, compares the contents of the corresponding file generated by generate_hash_arrays, with the hash value generated by using the Fortran hash function on the corresponding sequence of eight bit integers.If the bits of the corresponding hash values are not identical it stops with an error message.

Note: The makefile, Makefile.validation, generates the three executables from the source code in the file. Currently the makefile is specialized for the GCC compiler suite, as the Intel C compiler cannot compile one of the source files, that is designed to be compiled by either gcc or MSVC.

jvdp1 · 2021-11-27T16:31:56Z

Thank you all. I will close this PR in favour of #573

wclodius2 added 6 commits October 19, 2021 14:24

Hash functions for the Fotran Standard Library

f731dd5

I have translated a number of hash functions from C or C++ to Fortan90, and have incorporated them into the current stdlib disstribution. [ticket: X]

Updated doc/specs/index.md file

b8f84f7

Updated doc/specs/index.md file to include a reference to the hash functions documentation. [ticket: X]

subject line

1c31148

Corrected spelling Corrected spelling.

Fixed Makefiles

a38867d

I fixed mistakes in the stdlib/Makefile.Manual and stdlib/src/Makefile.Manual. [ticket: X]

Fixed Makefile.manual

d8b2fde

Changed filenames in stdlib/src/tests/hash_functions/Makefile.manual [ticket: X]

jvdp1 added reviewers needed This patch requires extra eyes topic: algorithms searching and sorting, merging, ... labels Oct 20, 2021

gareth-nx reviewed Nov 5, 2021

View reviewed changes

doc/specs/index.md Show resolved Hide resolved

gareth-nx reviewed Nov 5, 2021

View reviewed changes

doc/specs/stdlib_hash_functions.md Show resolved Hide resolved

gareth-nx reviewed Nov 5, 2021

View reviewed changes

src/tests/hash_functions/test_64_bit_hash_performance.f90 Show resolved Hide resolved

gareth-nx reviewed Nov 5, 2021

View reviewed changes

wclodius2 added 4 commits November 6, 2021 11:38

Merge https://github.com/fortran-lang/stdlib into hash_functions

00e622a

Needed to deal with the changes in the specification of the version of fpm to be used in processing the branch.

Changed discussion of keys and stdlib_32_bit_hash_functions,md

3489f00

Gareth found some of the discussion ambiguous so I tried to make it clearer. [ticket: X]

Changed README.mmd to READMME.md

1134094

Fixed the suffix of the README file, and slightly modified the contents. [ticket: X]

wclodius2 added 19 commits November 12, 2021 20:26

Trying to get two where I can rebase

ff62a21

what happened [ticket: X]

Getting to where I can merge

c68388b

what happened [ticket: X]

Getting to rebase

dda8997

what happened [ticket: X]

getting to rebase

a68c1f5

what happened [ticket: X]

Getting ready for rebase

5d3e281

what happened [ticket: X]

Getting ready for rebase

8cd021d

what happened [ticket: X]

Getting ready for rebase

38e7cae

what happened [ticket: X]

Reset the executable bits

29ae7cb

what happened [ticket: X]

Corrected bit permissions.

3cf3bea

[ticket: X]

Changed file permissions to non-executable.

2395e93

what happened [ticket: X]

Corrected executable mode

bd41342

what happened [ticket: X]

Corrected file permissions.

ab89f1f

what happened [ticket: X]

Corrected executable permissions

13ef4bd

what happened [ticket: X]

Corrected executable mode.

fb8394e

what happened [ticket: X]

Corrected executable bit.

c2cb605

what happened [ticket: X]

Corrected executable bit.

7087ccc

what happened [ticket: X]

Corrected executable bit.

57664f7

what happened [ticket: X]

Corrected executable bit.

43bca7f

what happened [ticket: X]

Corrected executable bit.

b7f4fc8

what happened [ticket: X]

jvdp1 reviewed Nov 13, 2021

View reviewed changes

gareth-nx mentioned this pull request Nov 24, 2021

Revised Hash functions incorporating changes in the main Stdlib repository. #573

Merged

jvdp1 closed this Nov 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash functions #554

Hash functions #554

wclodius2 commented Oct 19, 2021 •

edited

Loading

jvdp1 commented Oct 20, 2021

wclodius2 commented Oct 20, 2021

gareth-nx Nov 5, 2021

wclodius2 Nov 5, 2021

gareth-nx Nov 5, 2021

wclodius2 Nov 5, 2021

gareth-nx Nov 5, 2021

wclodius2 Nov 5, 2021

gareth-nx Nov 5, 2021

wclodius2 Nov 5, 2021

gareth-nx Nov 5, 2021

gareth-nx Nov 5, 2021

gareth-nx Nov 5, 2021

wclodius2 Nov 5, 2021

gareth-nx Nov 5, 2021

wclodius2 Nov 5, 2021

gareth-nx left a comment

wclodius2 commented Nov 5, 2021

gareth-nx Nov 5, 2021

wclodius2 Nov 5, 2021

wclodius2 Nov 6, 2021

gareth-nx commented Nov 5, 2021

wclodius2 commented Nov 6, 2021

jvdp1 left a comment

wclodius2 commented Nov 18, 2021

wclodius2 commented Nov 22, 2021

jvdp1 commented Nov 22, 2021

wclodius2 commented Nov 22, 2021

jvdp1 commented Nov 24, 2021

gareth-nx commented Nov 24, 2021

wclodius2 commented Nov 24, 2021

jvdp1 commented Nov 27, 2021

Hash functions #554

Hash functions #554

Conversation

wclodius2 commented Oct 19, 2021 • edited Loading

jvdp1 commented Oct 20, 2021

wclodius2 commented Oct 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gareth-nx left a comment

Choose a reason for hiding this comment

wclodius2 commented Nov 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gareth-nx commented Nov 5, 2021

wclodius2 commented Nov 6, 2021

jvdp1 left a comment

Choose a reason for hiding this comment

wclodius2 commented Nov 18, 2021

wclodius2 commented Nov 22, 2021

jvdp1 commented Nov 22, 2021

wclodius2 commented Nov 22, 2021

jvdp1 commented Nov 24, 2021

gareth-nx commented Nov 24, 2021

wclodius2 commented Nov 24, 2021

jvdp1 commented Nov 27, 2021

wclodius2 commented Oct 19, 2021 •

edited

Loading