Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revised Hash functions incorporating changes in the main Stdlib repository. #573

Merged
merged 126 commits into from
Dec 22, 2021

Conversation

wclodius2
Copy link
Contributor

Numerous changes were made in the main repository to deal with the lack of double precision in gfortran's implementation for the Mac M1 processor, and a few other issues, meant that the hash functions source code needed to be merged with the main repository. This proved overwhelming so I did a clone of the main repository, copied the changes for the hash function codes into that, made further changes in the code in response to comments on the original PR, and generated a new PR. As with the original

I have created a number of hash functions for incorporation in the Fortran standard library by translating from C or C++ a number of algorithms recommended by Reini Urban who maintains a well respected code, SMHasher, for testing the properties and performance of hash functions. The codes can hash default character strings and integer vectors. The hash functions are:

  • fibonacci hash: a scalar hash used to map 32 or 64 bit integers to power of two sized hash tables.
  • 32 and 64 bit versions of the FNV-1 and FNV-1a: simple hash functions with excellent performance for small keys, whose poor statistical properties appear to have only a minor impact on the performance of small hash tables.
  • nmhash32 and nmhash32x: 32 bit hashes with excellent statistical properties, but whose performance, so far, seems to be poor.
  • water_hash: a 32 bit hash function with excellent statistical performance and very good performance on large keys.
  • pengy hash: a 64 bit hash function with excellent statistical performance and good performance on large keys.
  • SpookyV2 hash: a 128 bit hash with excellent statistical performance and excellent performance on large keys.
  • A number of random seed generators for the above hashes.

Bringing over codes fro original hash_functions branch

[ticket: X]
Brought over codes from the original hash_functions branch.

[ticket: X]
Brought over code from original hash_functions

[ticket: X]
Modified so it doesn't check overflow of integers.

[ticket: X]
Added the gcc compiler flag no-range-checks

[ticket: X]
Added files with hash function modules to the list of fypp files to be processed.

[ticket: X]
Added hash_functions to the test directories

[ticket: X]
Added documentation for the hash functions.

[ticket: X]
Removed two filesusedd in an older version of hash validation.

[ticket: X]
Added generate_key_array.f90 generate_hash_arrays.f90 and hash_validity_test..f90
to implement a new version of the validity test.

[ticket: X]
Modified the Makefile so it would compile the new validation.

[ticket: X]
Added README>md that is documentation for the code in the hash function validation
directory.

[ticket: X]
Implemented fixes to various files addressing problems brought up by
Jeremie Vandenplas.

[ticket: X]
Change a compile flag for gfortran from -fwrapv to -fno-range-check.

[ticket: X]
Changed from parentheses to square brackets.

[ticket: X]
@jvdp1 jvdp1 mentioned this pull request Nov 24, 2021
@gareth-nx
Copy link
Contributor

Note the previous PR is here (with initial discussions): #554

@wclodius2 wclodius2 changed the title Revised Hash functions incorporating changes in the min Stdlib repository. Revised Hash functions incorporating changes in the main Stdlib repository. Nov 25, 2021
Copy link
Contributor

@gareth-nx gareth-nx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the previous pull request, this looks very impressive.

Most of my comments are quite minor.

I appreciate that you've added some regression tests. For now they are outside of the regular stdlib test-suite (due to dependency on C/C++), and I think we should discuss with others whether that is the right approach for now, or whether now is the time to start supporting C/C++ dependencies. I don't have a deep enough knowledge on the software engineering implications to make strong statements about this.

Changed three files "CMakeLists.txt", "stdlib_hash_functions.md", and
"stdlib_64_bith_hash_functions.fypp"in response to @gareth-nx's comments

[ticket: X]
Incorporated more changes in stdlib_hash_functions.md based on comments by
@garreth-nx.

[ticket: X]
Copy link
Member

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wclodius2 . Here are some comments on the specs.

In general it is an impressive amount of work, with well detailed documentation.

I will now focus on the code

wclodius2 and others added 5 commits November 27, 2021 16:39
Renamed stdlib_hash_functions.md to stdlib_hash_procedures.md.

[ticket: X]
Renamed stdlib_hash_functions.md to stdlib_hash_proceedures.md effectively
deleting it.

[ticket: X]
Changed documentation in ways suggested by Jeremie's comments.

[ticket: X]
Copy link
Member

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some additional comments

Comment on lines 9 to 12
!! The algorithms come in multiple versions, depending on whether the
!! vectorized instructions SSE2 or AVX2 are available. As neither instruction
!! is available in portable Fortran 2008, the algorithms that do not use these
!! instructions are used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for further improvements: would it be possible that the compiler select the right code when e.g., mentioning the option for supporting AVX2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure if a preprocessor can inquire of its existence from the compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to preprocessing, could be to use a logical parameter and rely upon the compiler's ability to remove dead code (similar to the little-endian parameter). A second option is to pack the different implementations in submodules, which are then substituted at link time based upon a flag to the build system. But personally I'm happy with the portable version.

Comment on lines 187 to 214
interface spooky_init

module subroutine spookysubhash_init( self, seed )
type(spooky_subhash), intent(out) :: self
integer(int_hash), intent(in) :: seed(2)
end subroutine spookysubhash_init

end interface spooky_init


interface spooky_update

module subroutine spookyhash_update( spooky, key )
type(spooky_subhash), intent(out) :: spooky
integer(int8), intent(in) :: key(0:)
end subroutine spookyhash_update

end interface spooky_update


interface spooky_final

module subroutine spookyhash_final(spooky, hash_code)
type(spooky_subhash), intent(inout) :: spooky
integer(int_hash), intent(inout) :: hash_code(2)
end subroutine spookyhash_final

end interface spooky_final
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am correct, there are private interfaces. If so, could they be moved to the submodules? Or what is the reasons to keep them here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are the basis of an incremental hasher, and were originally public.I eventually decided that including the incremental hasher complicated the API in ways that were inappropriate for an initial release, though they could be easily be made public.

@wclodius2
Copy link
Contributor Author

I am having some problems with git. I accepted a couple of of suggestions of @jvdp1 to modify the stdlib_hash_functions.md in https://github.com/wclodius2/stdlib/tree/hash_functions2, but most of his suggested changes were significantly modified, and entered on my local copy, including the implicit suggestion that stdlib_hash_functions.md be renamed stdlib_hash_procedures.md. I therefore wanted to pull https://github.com/wclodius2/stdlib/tree/hash_functions2, merge it with my local copy, and then push the resulting code back to the GitHub branch. I initially tried to initially push my code to the GitHub branch

git push william
To github.com:wclodius2/stdlib.git
 ! [rejected]        hash_functions2 -> hash_functions2 (fetch first)
error: failed to push some refs to 'github.com:wclodius2/stdlib.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

I didn't find the 'Note about fast-forwards' useful so I tried pulling the remote, and ended up pulling https://github.com/fortran-lang/stdlib

git pull origin master
hint: Pulling without specifying how to reconcile divergent branches is
hint: discouraged. You can squelch this message by running one of the following
hint: commands sometime before your next pull:
hint: 
hint:   git config pull.rebase false  # merge (the default strategy)
hint:   git config pull.rebase true   # rebase
hint:   git config pull.ff only       # fast-forward only
hint: 
hint: You can replace "git config" with "git config --global" to set a default
hint: preference for all repositories. You can also pass --rebase, --no-rebase,
hint: or --ff-only on the command line to override the configured default per
hint: invocation.
remote: Enumerating objects: 34, done.
remote: Counting objects: 100% (34/34), done.
remote: Compressing objects: 100% (19/19), done.
remote: Total 34 (delta 16), reused 23 (delta 14), pack-reused 0
Unpacking objects: 100% (34/34), 10.37 KiB | 442.00 KiB/s, done.
From https://github.com/fortran-lang/stdlib
 * branch            master     -> FETCH_HEAD
   57cfaf0..3af3259  master     -> origin/master
Auto-merging src/Makefile.manual
Auto-merging src/CMakeLists.txt
Auto-merging doc/specs/index.md
Auto-merging Makefile.manual
Auto-merging CMakeLists.txt
Merge made by the 'recursive' strategy.
 .github/PULL_REQUEST_TEMPLATE.md |  4 +++
 .github/workflows/CI.yml         |  4 ++-
 API-doc-FORD-file.md             |  3 ++
 CMakeLists.txt                   | 12 ++++++--
 Makefile.manual                  |  6 ++++
 VERSION                          |  1 +
 ci/fpm-deployment.sh             |  6 ++++
 ci/fpm.toml                      |  2 +-
 doc/specs/index.md               |  1 +
 doc/specs/stdlib_version.md      | 64 +++++++++++++++++++++++++++++++++++++++++++
 src/CMakeLists.txt               |  4 +++
 src/Makefile.manual              |  3 +-
 src/common.fypp                  |  3 ++
 src/stdlib_version.fypp          | 65 ++++++++++++++++++++++++++++++++++++++++++++
 14 files changed, 173 insertions(+), 5 deletions(-)
 create mode 100644 .github/PULL_REQUEST_TEMPLATE.md
 create mode 100644 VERSION
 create mode 100644 doc/specs/stdlib_version.md
 create mode 100644 src/stdlib_version.fypp

So now my local copy is up to date WRT https://github.com/fortran-lang/stdlib, but not WRT https://github.com/wclodius2/stdlib/tree/hash_functions2. Another git push failed with its earlier error message. I suspect I have to change my remotes, but how I am to change them I don't know.

git remote -v
origin	https://github.com/fortran-lang/stdlib (fetch)
origin	https://github.com/fortran-lang/stdlib (push)
william	[email protected]:wclodius2/stdlib.git (fetch)
william	[email protected]:wclodius2/stdlib.git (push)

Any suggestions?

@jvdp1
Copy link
Member

jvdp1 commented Nov 28, 2021

After commiting suggestions on github, you must pull the remote branch into your local one as follows:

git pull william hash_functions2

If there are no conflicts, then all changes will be merged in your local branch, and you can then push your branch.
If there are commites, you need first to resolve them manually in your local project.

@awvwgk
Copy link
Member

awvwgk commented Dec 19, 2021

How critical is it that I change the names in the documentation to lower case?

More for consistency with existing specs, so far we were using always lowercase for procedure names. I don't really mind, but this is something that was noticeable on the first glance.

@jvdp1
Copy link
Member

jvdp1 commented Dec 19, 2021

How critical is it that I change the names in the documentation to lower case?

More for consistency with existing specs, so far we were using always lowercase for procedure names. I don't really mind, but this is something that was noticeable on the first glance.

I opened a PR against the @wclodius2 branch for modifying all names. hopefully I missed none of them...

@wclodius2
Copy link
Contributor Author

I have now merged in

  1. @jvdp1 lower causing of the UPPER case procedure names, and typo fixes;
  2. The recent changes in the master standard library

The result passes all checks, and I believe is ready to merge into the stdlib master.

Copy link
Member

@awvwgk awvwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, it looks great. I'm happy to get it merged.

A final suggestion, we should add the new modules to the CHANGELOG.md.

@awvwgk awvwgk removed the reviewers needed This patch requires extra eyes label Dec 21, 2021
Updated CHANGELOG.md to reflect the new modules stdlib_hash_32bit and
stdlib_hash_64bit.

[ticket: X]
@wclodius2
Copy link
Contributor Author

I believe I have properly added the modules to CHANGELOG.md.

@gareth-nx
Copy link
Contributor

I think this is ready to be merged? Any outstanding issues @milancurcic @ivan-pi ?

Copy link
Member

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added the names of the public procedures in the CHANGELOG.md

Copy link
Member

@ivan-pi ivan-pi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no major comments anymore. Thanks for all of your work, especially William.

I think I noticed some small formatting issues in the itemized sections of the markdown document, but would need to render it first. These can be dealt with in a separate PR.

@jvdp1
Copy link
Member

jvdp1 commented Dec 22, 2021

@milancurcic I believe all your comments were answered. However, I am afraid I could miss some. Could you check please, and merge this PR if it it the case?

Copy link
Member

@milancurcic milancurcic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you William and reviewers for this great addition!

@milancurcic milancurcic merged commit 9f1aa24 into fortran-lang:master Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: algorithms searching and sorting, merging, ...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants