Use SipHash-1-3 for all hashing, through a resilient hashing interface #14913

lorentey · 2018-03-01T20:01:46Z

This PR implements resilient hashing for stdlib types and types with synthesized Hashable implementations. It changes the standard hash function to SipHash-1-3 with a per-execution random seed.

Background

Hash tables are probabilistic data structures; their expected performance depends on (implicit and largely unknown) assumptions on the statistical distribution of their keys and the properties of the hash function that is used to derive bucket indices for individual key values.

Weak hash functions can enable accidental (or deliberate) skewing of the key distribution in such a way that collisions become more likely. In extreme cases, lookup performance may become linear rather than the promised O(1), which may turn other operations using lookups quadratic rather than linear, etc. This is not a hypothetical concern: there have been major cases of successful denial of service attacks targeted against hash tables implemented in various programming languages and libraries.

In general purpose collection types like Swift's Set or Dictionary, we can't meaningfully constrain the distribution of keys. Our only option to protect against hash collision cascades is to choose a hash function that's strong enough to handle all key distributions, ideally even those that deliberately try to defeat it.

The Status Quo

Our current approach is not particularly great in this regard: the choice of the hash function is mostly in the hands of the programmer implementing hashValue, and we provide no public API to guide them. Implementing a good hash function requires careful consideration and specialist knowledge; it takes time and effort they clearly can't be expected to invest.

To fix this, we should start by improving cases where hashValue is under Swift's control: inside the Standard Library and in Hashable implementations synthesized by the compiler. For these, we currently rely on two underscored functions:

_mixInt: (Int) -> Int was designed to be used as a hashValue post-processing tool in hashed collections, improving distribution of user-generated hash values so that occupied buckets are less likely to bunch together in long chains. It does a fine job on this task; however, hash ^= _mixInt(foo) has been widely (mis)used as a rudimentary compression function, providing essentially no protection against even trivial accidental collisions.
_combineHashes: (Int, Int) -> Int is our recently introduced hash compression function, used in synthesized Hashable implementations. Its current implementation wasn't designed to protect against deliberate collisions, even if we were to seed it with a random key. Unfortunately, its stateless interface severely limits the scope of hash functions it can implement. For example, SipHash needs to maintain 256 bits of state; it is not possible to fit that into this interface.

This PR intends to improve matters by standardizing on a high quality hash function, SipHash, for use inside the standard library and inside compiler-synthesized Hashable implementations. (SipHash is already implemented in the stdlib, but so far it has been sitting there largely unused.)

This PR does not introduce public API to help with manual Hashable implementations -- but it provides the first step towards providing one.

Requirements

Stateful hash function -- We need to be able to support hash functions with a state larger than a single pointer-sized integer.
Random hash seeding -- Hashable has been documented since Swift 1.0 to allow per-execution hash seeds, but so far we have only implemented random seeds for Strings on non-Apple platforms. Random seeding makes hash values harder to predict; we need to enable it on all platforms and all Hashable types. (In some contexts, we may want to go further and introduce local seeds for each Dictionary capacity (or even instance). But a per-execution seed is the obvious first step.)
Resiliency -- To enable us to replace/evolve the hash function in future versions of the stdlib, the hash algorithm must not be baked into types implementing Hashable; it must be implemented entirely behind resiliency boundaries. (Ideally this includes the size/layout of the state, not just the code itself.)

New `Hashable` Requirement

To satisfy these constraints, this PR adds a new requirement to Hashable that allows the hash function to be supplied externally:

protocol Hashable {
  var hashValue: Int { get } // Existing public interface
  func _hash(into hasher: inout _Hasher) // New internal hotness
}

The new requirement is not designed to be manually implemented outside the stdlib, so it remains underscored, with a default implementation. hashValue remains the public API for manual hashing. To help users provide high-quality hashValue implementations, we may expose Hasher as documented API later. In addition, we may decide to promote hash(into:) into a documented requirement, intended to replace hashValue. (These require swift-evolution proposals.)

_Hasher represents the internal state of the hash function. It provides a mutating interface to append additional bits to the hash function, mixing them into the state.

_Hasher has a public parameterless initializer and a public finalizer. The struct itself is intentionally not marked @_fixed_layout; its members are not inlinable, either, except for the generic variant of append(_:). Here is its full exported interface:

public struct _Hasher { // Size/layout isn't exposed
  public init()

  @inline(__always)
  public func append<H: Hashable>(_ value: H) {
    value._hash(into: &self)
  }

  public func append(bits: Int)
  // ... other integer overloads for append(bits:) ...

  public func finalize() -> Int
}

Synthesized `Hashable` Implementations

Note _hash(into:) synthesis has been deferred to a future PR.

Synthesized implementations of Hashable now generate a _hash(into:) implementation that does the actual work of hashing a type's components. The generated hashValue implementation simply instantiates a new hasher and feeds it to _hash(into:). For example, here is how Hashable conformance is derived for a simple struct:

struct Book: Hashable {
  let title: String
  let authors: [Author]
  let pageCount: Int

  @derived var hashValue: Int { 
    return _hashValue(for: self) 
  }
  @derived func _hash(into hasher: inout _Hasher) {
    hasher.append(title)
    hasher.append(authors)
    hasher.append(pageCount)
  }
}

Per-execution Random Hash Seed

Introducing a per-execution hash seed fulfills a long-standing prophecy in the documentation about hash values not being stable across different executions. Indeed, not even integers return a predictable hashValue any more:

$ cat hash.swift
print("42.hashValue = \(42.hashValue)")
$ swiftc hash.swift
$ ./hash
42.hashValue = 4676625759386310310
$ ./hash
42.hashValue = -2488622482260669029

(The hash values remain stable within the same process, though.) To enable repeatable results in cases like unit testing, the 128-bit hash seed is exposed as _Hashing.secretKey; setting it to a fixed constant value disables randomization. Currently this needs to be done before the first Set or Dictionary is created. StdlibUnitTest.TestSuite in the Swift test suite automatically zeroes out the key when it is first instantiated.

Performance

SipHash is more reliable, but it's also a bit more complicated than _mixInt/_combineHashes. Resiliency also adds a little bit of extra overhead. At the end of the day, #14442 measured these changes to cost up to +80% performance on our microbenchmarks. I expect we'll be able to optimize things further after this lands.

Resolves rdar://problem/24109692, rdar://problem/35052153

lorentey · 2018-03-01T20:05:44Z

@swift-ci please test

swift-ci · 2018-03-01T20:31:48Z

Build failed
Swift Test Linux Platform
Git Sha - e5f5ffb

lorentey · 2018-03-01T20:55:51Z

Sigh Let's try that again, shall we.

lorentey · 2018-03-01T20:57:30Z

@swift-ci please test

milseman · 2018-03-01T21:31:07Z

@atrick @aschwaighofer is this ok usage of @effects(readonly)?

jrose-apple · 2018-03-01T21:53:18Z

is this ok usage of @effects(readonly)?

I'm not Andy or Arnold but I say definitely not. @effects(readonly) means that the optimizer can omit the call to the function if its return value isn't used. Even if we don't do that optimization today I really wouldn't want to put it in that way. We can invent something else to avoid retain/releasing the Hasher, or just cross our fingers for MichaelG's +0 work.

jrose-apple · 2018-03-01T21:55:06Z

Question: why isn't _UnsafeHasher taken as inout, rather than taken/returned?

lorentey · 2018-03-01T21:58:46Z

The return value is actually being used here, though; the calls are chained together to prevent invalid optimizations. The hasher state itself is just a bunch of integers; @effects(readonly) prevents ARC traffic protecting components of the Hashable types, not the hasher itself.

aschwaighofer · 2018-03-01T22:05:40Z

@milseman (@atrick, @eeckstein ) Like you suspected -- I don't think we would want to annotate a function that mutates state as 'readonly' even if it happens that within our existing pipeline this might work for you.

I think we need a @effects(releasenone) and map that to sideeffectsanalysis' 'GlobalEffects.Release=false'. I think that should trick. @eeckstein?

lorentey · 2018-03-01T22:05:45Z

inout implies mutations, and every time I tried providing opaque mutating functions on Hasher, I measured major ARC-related slowdowns (as much as +300%, if I recall correctly). Inlineable mutating funcs were fine, but as soon as the body wasn't available, things got really slow.

I am yet to try running this on the +0 branch, but it sure would be nice to replace this with a straightforward interface.

lorentey · 2018-03-01T22:37:41Z

@aschwaighofer Ah, GlobalEffects.mayRelease() explains so much; it seems like a new @effects attribute for it would fit the bill perfectly here, and we'd probably use it elsewhere throughout the stdlib.

Replacing the fragile _UnsafeHasher hack with an inout Hasher would allow us to consider promoting hash(into:) into a public replacement for hashValue.

aschwaighofer · 2018-03-01T23:11:19Z

@lorentey I have prototyped this in the following branch:
https://github.com/aschwaighofer/swift/tree/wip_effects_release_none

eeckstein · 2018-03-01T23:26:07Z

I think we need a @effects(releasenone) and map that to sideeffectsanalysis' 'GlobalEffects.Release=false'. I think that should trick. @eeckstein?

Yes, that should work.

jrose-apple

Comments on the compiler part, which mostly looks good save for the one glaring cheat of synthesizing _hash(into:) whenever we synthesize hashValue.

jrose-apple · 2018-03-01T23:37:14Z