-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SipHash-1-3 for all hashing, through a resilient hashing interface #14913
Conversation
@swift-ci please test |
Build failed |
Sigh Let's try that again, shall we. |
@swift-ci please test |
@atrick @aschwaighofer is this ok usage of |
I'm not Andy or Arnold but I say definitely not. |
Question: why isn't |
The return value is actually being used here, though; the calls are chained together to prevent invalid optimizations. The hasher state itself is just a bunch of integers; |
@milseman (@atrick, @eeckstein ) Like you suspected -- I don't think we would want to annotate a function that mutates state as 'readonly' even if it happens that within our existing pipeline this might work for you. I think we need a |
I am yet to try running this on the +0 branch, but it sure would be nice to replace this with a straightforward interface. |
@aschwaighofer Ah, Replacing the fragile |
@lorentey I have prototyped this in the following branch: |
Yes, that should work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments on the compiler part, which mostly looks good save for the one glaring cheat of synthesizing _hash(into:)
whenever we synthesize hashValue
.
/// \param hashable The parameter to the call. | ||
static CallExpr* createHasherAppendingCall(ASTContext &C, | ||
Expr* hasher, | ||
Expr* hashable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: The old style wasn't being consistent about this, but please put pointer *
s on the name rather than the type. (That is, space before, not after.)
Expr* hashable) { | ||
// hasher.appending(_:) | ||
SmallVector<Identifier, 1> argNames{Identifier()}; | ||
DeclName name(C, C.Id_appending, argNames); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: You don't need a SmallVector for this; just pass {Identifier()}
directly to DeclName's constructor. Or if you want to give it a name, declare it as a C array: Identifier argNames[] = …
. (As you do below, I see.)
Expr *args[1] = {hashable}; | ||
Identifier argLabels[1] = {Identifier()}; | ||
return CallExpr::createImplicit(C, appendingCall, C.AllocateCopy(args), | ||
C.AllocateCopy(argLabels)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these will be copied by createImplicit
, so you don't need to do another copy here. (ASan will shout at you very fast if I'm wrong about that.)
// otherwise matter. | ||
|
||
// hashDecl->copyFormalAccessAndVersionedAttrFrom(typeDecl); | ||
hashDecl->setAccess(std::max(typeDecl->getFormalAccess(), AccessLevel::FilePrivate)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very fishy to me. Why can't we use copyFormalAccessAndVersionedAttrFrom
? And can that go into the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verily, it is extremely fishy. Adding some clarification:
if (typeDecl->getFormalAccess() != AccessLevel::Private) {
hashDecl->copyFormalAccessAndVersionedAttrFrom(typeDecl);
} else {
// FIXME: We want to call copyFormalAccessAndVersionedAttrFrom here, but we
// can't copy the access level of a private type, because of the backhanded
// way we synthesize _hash(into:) in deriveHashable -- we need to make sure
// the resolver will find the new function, so it needs to be at least
// fileprivate. (The access level of synthesized members doesn't normally
// matter; they don't go through an access level check after being returned
// from the synthesizer.)
hashDecl->setAccess(AccessLevel::FilePrivate);
if (typeDecl->getAttrs().hasAttribute<VersionedAttr>()) {
hashDecl->getAttrs().add(new (C) VersionedAttr(/*implicit=*/true));
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'm pretty sure copyFormalAccessAndVersionedAttrFrom
should just take a flag that says whether you're copying from the context or from a peer declaration. We've had this problem elsewhere.
SourceLoc(), | ||
resultPat, nullptr, | ||
hashValueDecl); | ||
// hasher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put a bit more context into comments like this.
else if (auto theStruct = dyn_cast<StructDecl>(type)) | ||
hashIntoSynthesizer = &deriveBodyHashable_struct_hashInto; | ||
else | ||
llvm_unreachable(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave an actual comment here, like "attempt to derived hashInto for a type other than a struct or enum".
// would be nicer to remove the default implementation and independently | ||
// synthesize either/both of the two Hashable requirements as needed; | ||
// however, synthesizing methods (as opposed to properties) into imported | ||
// types is not yet fully supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want @DougGregor or @huonw to at least glance at this. It seems like it'll work, but I'm not at all sure that your assertion below won't fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did find one concrete case so far where this fails: in a particular compiler crasher test where multiple diagnostics preceded the one generated by the sad path here.
I'll submit my alternative implementation as a followup PR soon. The obstacles didn't seem unsurmountable at all, it just took too long for me to understand/fix all of them. The final two issues were a weird IRGen issue with _CFObject's vtable and diagnostic regressions due to canDeriveHashable
unconditionally returning true.
cc @allevato |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some scattered thoughts. Unfortunately not able to spend as much time with this PR as I'd have liked, but overall excited to see a clearly better solution taking shape.
stdlib/public/core/SipHash.swift.gyb
Outdated
for _ in 0..<${c_rounds} { | ||
_SipHashDetail._sipRound(v0: &v0, v1: &v1, v2: &v2, v3: &v3) | ||
} | ||
% for _ in range(0, c_rounds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes, do we really need to manually unroll this loop using gyb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really think so; I intend to run a set of benchmarks for it later (and a separate set for removing @always(__inline)
on compress(_:)
).
In any case, c_rounds
is 1 in SipHash13, so it doesn't really matter.
stdlib/public/core/KeyPath.swift
Outdated
return hasher | ||
.appending(value) | ||
.appending(isStoredProperty) | ||
.appending(isTableOffset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice, after the possibility of inout _UnsafeHasher
and any coordinating changes to the API are sorted out, if in the final form of the API we could make a single call at the use site. That is, hasher.append{ing}(foo, bar, baz)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do that for an arbitrary number of heterogeneous arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, they'd have to be AnyHashable
-boxed. That said, if it's a transparent convenience function I'd expect the whole thing might be optimized away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some concerns about the original suggestion because of the fear that an array would need to be heap-allocated for the variable-length argument list. In some cases it looks like the compiler is smart enough to optimize that out—for example, unrolling a loop over those arguments. Is there any way we can guarantee that?
// `AnyHashable`-boxed integers, all integer values are currently required | ||
// to hash exactly the same way as the corresponding (U)Int64 value. To fix | ||
// this, we should introduce a custom AnyHashable box for integer values | ||
// that sign-extends values to 64 bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't think it's problematic that equal integer values hash the same way regardless of their bit width.
In fact, it helps us out nicely if ever we want to extend heterogeneous ==
beyond BinaryInteger
types to, say, Array<T> where T: BinaryInteger
. I absolutely agree that enabling per-execution variation in hash values for integers is strictly an improvement, but I don't think treating this as a fixme falls into the same category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It causes potential performance issues, because we're doing a full 64-bit SipHash compress
round for every UInt8
value, when we could batch eight of them up into a single round instead. This also applies to Bool
s, small enums and the like. This effect needs to be measured to quantify it, though; it may not actually turn out to be significant.
Another potential issue is that extending everything to 32/64 bits washes away distinctions between distinct types, perhaps slightly weakening SipHash.
An alternative is to analyze wide integer values and hash them at the shortest width that can contain them -- like the code currently does for Int64
on 32-bit systems. I suspect this would have a measurable effect on Int.hashValue's performance, though -- but this needs to be verified. We'd also have to devise separate API for code that wants to feed a specific series of bytes to the hasher, with no such arbitrary preprocessing.
stdlib/public/core/Hashing.swift
Outdated
// The effects attribute is a lie; however, it enables the compiler to | ||
// eliminate unnecessary retain/releases protecting Hashable state around | ||
// calls to `_Hasher.append(_:)`. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a whole bunch of thoughts earlier on this particular design, but I think everyone who's commented is in agreement so far about how we feel about it. Hopefully, the new @effects
annotation will fall into place swiftly.
It's not merely for the sake of elegance, but I think the whole design of the API suffers in how comprehensible it is to the reader when what's clearly a stateful object being mutated (after all, the whole point of this PR is to enable a stateful hasher--or at least that's my understanding of it) is masquerading as a value type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fun thing is that a fully immutable SipHash implementation (with a 100% valid @effect(readonly)
declaration) actually performed surprisingly well, considering it had to keep passing around five words. In use it looked exactly the same as this one, except it actually was a real value type.
However, its performance relied on the size/layout of the hasher state being part of the ABI -- hence the clever replacement. Struct resiliency may cause performance problems for the @effects(releasenone)
mutating case, too -- we'll have to find out.
static func overrideHashingKey() { | ||
if !hashingKeyOverridden { | ||
// FIXME(hasher): This has to run before creating the first Set/Dictionary | ||
_Hashing.secretKey = (0, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously, we need some way to set a deterministic key for testing purposes. I'm not expert in this area by miles, but is it preferable to expose a public, undocumented API which a nefarious third-party could conceivably set at will, or would it be safer to make this, say, a build option for the compiler (say, debug only)? (I genuinely don't know.) I recognize that you didn't create this property originally, but now this clearly "does more."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! This is a really interesting issue with no fully satisfying solution. We somehow need to pass a flag to the runtime during early process startup to override the random value.
A compile-time option is one way to do this, but I think I'd prefer to just use an environment variable. It is simple, it doesn't need a recompilation, and influencing runtime behavior via the environment has oodles of precedent (not all good, mind you 🙂).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option to consider is to store a copy of the key in every hash table instance. (That may be worth doing anyway to explicitly prepare for the possibility of per-instance keys.) Doing this would eliminate the awful effects of someone changing the master key while a hash table already exists, even if we keep the key accessible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pure Swift, AFAIK we don't have a way to guarantee that we could read that environment variable and cache its value before any other Swift code has executed. Can we play some static initialization tricks with the C portion of the standard library to do that instead? Like @xwu I understand the need for this behavior but the possibilities for abuse do worry me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keys are initialized the runtime before main()
, in a C++ global static constructor. The code currently uses C++'s Mersenne Twister implementation to generate the two halves of the key. Although we can't run Swift code there, but we could easily call getenv
or look up some special symbol, or pretty much do whatever we want:
We could also make use of HashingDetail.fixedSeedOverride
, if we want to keep the original key around for some reason. (The fixed seed override is currently unused.)
It's unlikely we can protect the key from sufficiently motivated Swift code running within the process; thankfully we don't really need to do that -- the key's role is merely to protect hash table performance against data that is fed into the process from external sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eww, is there any alternative to static constructors? Also important, does that have a corresponding static destructor?
edit: grammar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of! Do you foresee issues?
We don't have destructors, but these are integer values -- AFAIK, we don't have to run any cleanup for them at exit.
While we have the tests passing, let's see if anyone relies on specific hash values: @swift-ci please test source compatibility For build progress, see https://ci.swift.org/view/Pull%20Request/job/swift-PR-source-compat-suite/819/ |
stdlib/public/core/Hashing.swift
Outdated
self._core = Core() | ||
} | ||
|
||
@inline(__always) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@_inlineable
or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't @inline(__always)
imply @_inlineable
? (Still, it is a good idea to make it explicit.)
28effb8
to
7edec4d
Compare
7edec4d
to
64679fb
Compare
@swift-ci please test |
@swift-ci please benchmark |
Build failed |
Build failed |
Build comment file:Optimized (O)Regression (46)
Improvement (14)
No Changes (325)
Unoptimized (Onone)Regression (44)
Improvement (16)
No Changes (325)
Hardware Overview
|
@lorentey I'd really love having |
@swift-ci please smoke test |
OK, the Linux failure uncovered an issue with my synthesizing Landing the rest of this change will unblock some stdlib work, so it seems worth doing. |
Introduce _Hasher, representing an opaque hash compression function. Add the method _hash(into:) as a Hashable requirement, decoupling the choice of hash function from Hashable's implementation. The default implementation of _hash(into:) has a default implementation that simply feeds hashValue to the hasher. Add _hash(into:) implementations for the default integer types. Note that Int.hashValue does not return self anymore. Add struct _LegacyHasher, emulating Swift 4.1 hashes in the new interface.
This switches the primary hashing interface from hashValue to _hash(into:).
64679fb
to
233beb0
Compare
@swift-ci please smoke test |
Beyond switching hashing algorithms, this also enables per-execution hash seeds, fulfilling a long-standing prophecy in Hashable’s documentation. To reduce the possibility of random test failures, StdlibUnittest’s TestSuite overrides the random hash seed on initialization. rdar://problem/24109692 rdar://problem/35052153
233beb0
to
e03cb13
Compare
@swift-ci please smoke test |
@swift-ci please test |
Compiler support is blocked by https://bugs.swift.org/browse/SR-7156 -- synthesizing However, I believe the stdlib changes are ready to merge now. |
@hartbit sounds like a good pitch! |
This PR implements resilient hashing for stdlib types and types with synthesized
Hashable
implementations. It changes the standard hash function to SipHash-1-3 with a per-execution random seed.Background
Hash tables are probabilistic data structures; their expected performance depends on (implicit and largely unknown) assumptions on the statistical distribution of their keys and the properties of the hash function that is used to derive bucket indices for individual key values.
Weak hash functions can enable accidental (or deliberate) skewing of the key distribution in such a way that collisions become more likely. In extreme cases, lookup performance may become linear rather than the promised O(1), which may turn other operations using lookups quadratic rather than linear, etc. This is not a hypothetical concern: there have been major cases of successful denial of service attacks targeted against hash tables implemented in various programming languages and libraries.
In general purpose collection types like Swift's
Set
orDictionary
, we can't meaningfully constrain the distribution of keys. Our only option to protect against hash collision cascades is to choose a hash function that's strong enough to handle all key distributions, ideally even those that deliberately try to defeat it.The Status Quo
Our current approach is not particularly great in this regard: the choice of the hash function is mostly in the hands of the programmer implementing
hashValue
, and we provide no public API to guide them. Implementing a good hash function requires careful consideration and specialist knowledge; it takes time and effort they clearly can't be expected to invest.To fix this, we should start by improving cases where
hashValue
is under Swift's control: inside the Standard Library and inHashable
implementations synthesized by the compiler. For these, we currently rely on two underscored functions:_mixInt: (Int) -> Int
was designed to be used as ahashValue
post-processing tool in hashed collections, improving distribution of user-generated hash values so that occupied buckets are less likely to bunch together in long chains. It does a fine job on this task; however,hash ^= _mixInt(foo)
has been widely (mis)used as a rudimentary compression function, providing essentially no protection against even trivial accidental collisions._combineHashes: (Int, Int) -> Int
is our recently introduced hash compression function, used in synthesizedHashable
implementations. Its current implementation wasn't designed to protect against deliberate collisions, even if we were to seed it with a random key. Unfortunately, its stateless interface severely limits the scope of hash functions it can implement. For example, SipHash needs to maintain 256 bits of state; it is not possible to fit that into this interface.This PR intends to improve matters by standardizing on a high quality hash function, SipHash, for use inside the standard library and inside compiler-synthesized
Hashable
implementations. (SipHash is already implemented in the stdlib, but so far it has been sitting there largely unused.)This PR does not introduce public API to help with manual
Hashable
implementations -- but it provides the first step towards providing one.Requirements
Hashable
has been documented since Swift 1.0 to allow per-execution hash seeds, but so far we have only implemented random seeds for Strings on non-Apple platforms. Random seeding makes hash values harder to predict; we need to enable it on all platforms and allHashable
types. (In some contexts, we may want to go further and introduce local seeds for each Dictionary capacity (or even instance). But a per-execution seed is the obvious first step.)Hashable
; it must be implemented entirely behind resiliency boundaries. (Ideally this includes the size/layout of the state, not just the code itself.)New
Hashable
RequirementTo satisfy these constraints, this PR adds a new requirement to
Hashable
that allows the hash function to be supplied externally:The new requirement is not designed to be manually implemented outside the stdlib, so it remains underscored, with a default implementation.
hashValue
remains the public API for manual hashing. To help users provide high-qualityhashValue
implementations, we may exposeHasher
as documented API later. In addition, we may decide to promotehash(into:)
into a documented requirement, intended to replacehashValue
. (These require swift-evolution proposals.)_Hasher
represents the internal state of the hash function. It provides a mutating interface to append additional bits to the hash function, mixing them into the state._Hasher
has a public parameterless initializer and a public finalizer. The struct itself is intentionally not marked@_fixed_layout
; its members are not inlinable, either, except for the generic variant ofappend(_:)
. Here is its full exported interface:Synthesized
Hashable
ImplementationsNote
_hash(into:)
synthesis has been deferred to a future PR.Synthesized implementations ofHashable
now generate a_hash(into:)
implementation that does the actual work of hashing a type's components. The generatedhashValue
implementation simply instantiates a new hasher and feeds it to_hash(into:)
. For example, here is howHashable
conformance is derived for a simple struct:Per-execution Random Hash Seed
Introducing a per-execution hash seed fulfills a long-standing prophecy in the documentation about hash values not being stable across different executions. Indeed, not even integers return a predictable
hashValue
any more:$ cat hash.swift print("42.hashValue = \(42.hashValue)") $ swiftc hash.swift $ ./hash 42.hashValue = 4676625759386310310 $ ./hash 42.hashValue = -2488622482260669029
(The hash values remain stable within the same process, though.) To enable repeatable results in cases like unit testing, the 128-bit hash seed is exposed as
_Hashing.secretKey
; setting it to a fixed constant value disables randomization. Currently this needs to be done before the firstSet
orDictionary
is created.StdlibUnitTest.TestSuite
in the Swift test suite automatically zeroes out the key when it is first instantiated.Performance
SipHash is more reliable, but it's also a bit more complicated than
_mixInt
/_combineHashes
. Resiliency also adds a little bit of extra overhead. At the end of the day, #14442 measured these changes to cost up to +80% performance on our microbenchmarks. I expect we'll be able to optimize things further after this lands.Resolves rdar://problem/24109692, rdar://problem/35052153