Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvements to serializer #21514

Merged
merged 1 commit into from
Apr 27, 2017
Merged

improvements to serializer #21514

merged 1 commit into from
Apr 27, 2017

Conversation

JeffBezanson
Copy link
Member

0.6 will necessarily change the serialization format, so I'm taking the opportunity to do some other improvements.

This improves the design a bit to remove weird hacks, fix a TODO item, and allow using backreferences for all objects, particularly Symbols and DataTypes. One common inefficiency before was serializing types repeatedly for all their instances.

Next, this preserves shared references for all mutable values. There is some potential for slowdowns there, but I think it's worth it for being both more accurate and more efficient when there are many references to a large object.

Also adds some more common values to the tags table, and adds tags for smaller string lengths and backrefs.

This changes the format, but is API compatible except in a rare corner case (when you want to define serialize and support shared references, but also want to use something other than your value's own type as the identifying tag).

@nanosoldier runbenchmarks("serialization", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @jrevels

end
resolve_ref_immediately(s, ndt)

# deserialize_typename can assume unwrap_unionall(wrapper).parameters is set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't it also assume that the layout, mutabl, and types fields are valid, and may start propagating bad assumption about ndt as soon as the ccall returned the incomplete object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code as written doesn't directly assume that --- i.e. deserialize_typename itself doesn't look at that info. We just need to make sure the type isn't used for anything serious until it's initialized. Hopefully null pointer accesses would indicate if we violate that. The boolean flags could be initialized earlier though, at least.

src/datatype.c Outdated
@@ -79,14 +79,12 @@ jl_datatype_t *jl_new_abstracttype(jl_value_t *name, jl_datatype_t *super, jl_sv
return jl_new_datatype((jl_sym_t*)name, super, parameters, jl_emptysvec, jl_emptysvec, 1, 0, 0);
}

jl_datatype_t *jl_new_uninitialized_datatype(void)
JL_DLLEXPORT jl_datatype_t *jl_new_uninitialized_datatype(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will also need to be exported in julia_internal.h

@vtjnash
Copy link
Member

vtjnash commented Apr 24, 2017

ref #308

@JeffBezanson JeffBezanson force-pushed the jb/improve_serialize branch from b1b465d to ca310e8 Compare April 24, 2017 20:18
@JeffBezanson
Copy link
Member Author

Hmm very interesting failure here. Log backed up: https://gist.github.com/JeffBezanson/e47083278bf39cdfbce276a5e044055a

@vtjnash
Copy link
Member

vtjnash commented Apr 25, 2017

From the log failure, it looks like something must have mutated some const field of a DataType type after allocating it............

@JeffBezanson
Copy link
Member Author

Yeah, I could believe that. I'm convinced I can hack around it somehow.

@vtjnash
Copy link
Member

vtjnash commented Apr 25, 2017

I'm not convinced that we should. The existence of this function means that any attempt to inspect the intermediate state of the deserializer (in the debugger, via profiling, if it has an IO error during deserialization) will immediately corrupt the entire process.

From experience with solving this for the incremental serializer in dump.c, this approach is also simply not workable – the ordering of what it needs to try to do is simply not feasible to do recursively. Roughly speak, the issue is that you need to defer ever sending the types field until the deserialization is completely finished (otherwise it risks running into a back-reference that triggers needing to know the layout for the type before you know the fields). However, that not really feasible, since a recursive serializer often needs to actually introspect those fields quite often.

Just to be clear though, I'm really happy with the rest of the changes here. I'm just not seeing the motivation for introducing the function that returns uninitialized datatypes. It just sounds really likely to introduce lots of other memory corruption and other issues. (based on my experience with having to solve the dual of this problem in dump.c)

@JeffBezanson
Copy link
Member Author

otherwise it risks running into a back-reference that triggers needing to know the layout for the type before you know the fields

How does this actually happen, though? It seems that would require e.g. an instance of the new DataType to exist inside the types field, which I believe is impossible for a wrapper. I can see that this might be unworkable for arbitrary DataTypes, but since this is used only for the wrapper field of a TypeName it seems like it might be possible.

Anyway, I can certainly put the TypeName/wrapper stuff back how it was, and keep the rest of this.

@vtjnash
Copy link
Member

vtjnash commented Apr 25, 2017

It seems that would require e.g. an instance of the new DataType to exist inside the types field

Is this any more complicated than:

type T
  x::T
end

The secondary problem it runs into is a case like:

type T
  x::Vector{T}
end
deserialize(...)::Vector{T}

which then ends up requiring deserializing the back-reference before the object itself.

@JeffBezanson
Copy link
Member Author

type T
  x::T
end

This one is just a garden variety reference cycle --- T.types points directly back to T, without needing to look at the type layout. This case actually works on this branch, but fails on master. Here's a test case (run on master):

julia> s = Symbol("#223");

julia> @eval type $s <: Function
        x::$s
       end

julia> b=IOBuffer();

julia> serialize(b, eval(s))

julia> seekstart(b);

julia> deserialize(b)
ERROR: UndefRefError: access to undefined reference
Stacktrace:
 [1] deserialize_datatype(::SerializationState{Base.AbstractIOBuffer{Array{UInt8,1}}}) at ./serialize.jl:871

More complicated field types could be a problem, but in all of these cyclic cases everything is a pointer, which we (somewhat by accident) correctly conclude from layout being NULL. So I think this is at least an improvement over what we have now.

@vtjnash
Copy link
Member

vtjnash commented Apr 25, 2017

I understand that, and can make it arbitrarily more complicated if you would like. But I'm not sure that "accidentally getting this case right" is necessarily a positive feature, "when probably getting similar cases catastrophically wrong" is the tradeoff. Changing this because we accidentally getting this case right is also pretty likely to get in the way of merging #18632 (unboxing more immutables) which needs to change the implementation of the layout computation.

@JeffBezanson
Copy link
Member Author

Well, I don't want you to make it arbitrarily more complicated, but I'd settle for an example of a case that is catastrophically wrong.

@JeffBezanson JeffBezanson force-pushed the jb/improve_serialize branch from ca310e8 to a89d6b3 Compare April 25, 2017 20:04
@JeffBezanson
Copy link
Member Author

Ok, controversial part removed.

@JeffBezanson JeffBezanson force-pushed the jb/improve_serialize branch from a89d6b3 to f475c7d Compare April 26, 2017 03:11
@JeffBezanson
Copy link
Member Author

A couple more improvements: reorganized the tags table, added a short (UInt16) backref, added short-Int64 (for Int64s with values in Int32 range), and split the Int tag to Int32- and Int64-specific ones, which will make files slightly more portable between 32- and 64-bit machines.

@JeffBezanson JeffBezanson added this to the 0.6.0 milestone Apr 26, 2017
@JeffBezanson JeffBezanson force-pushed the jb/improve_serialize branch from f475c7d to 9dca3ab Compare April 26, 2017 16:51
@kshyatt kshyatt added the io Involving the I/O subsystem: libuv, read, write, etc. label Apr 26, 2017
@JeffBezanson JeffBezanson force-pushed the jb/improve_serialize branch from 9dca3ab to 28e2d51 Compare April 26, 2017 23:13

const NTAGS = length(TAGS)

function sertag(v::ANY)
ptr = pointer_from_objref(v)
ptags = convert(Ptr{Ptr{Void}}, pointer(TAGS))
@inbounds for i in 1:NTAGS
ptr == unsafe_load(ptags,i) && return (i+1)%Int32
@inbounds for i in 1:(NTAGS-78) # constant ints & reserved slots never returned here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any convenient way to avoid hardcoding the 78?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can metaprogram my way out of that.

@JeffBezanson JeffBezanson force-pushed the jb/improve_serialize branch from 28e2d51 to c8a2539 Compare April 27, 2017 05:49
@JeffBezanson
Copy link
Member Author

@nanosoldier runbenchmarks("serialization", vs = ":master")


# tags >= this just represent themselves, their whole representation is 1 byte
const VALUE_TAGS = sertag(())
const ZERO_TAG = sertag(0)
const ZERO32_TAG = Int32(NTAGS-65)
const ZERO64_TAG = Int64(NTAGS-32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could make use of n_int_literals here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

down with magic numbers!

@JeffBezanson JeffBezanson force-pushed the jb/improve_serialize branch from c8a2539 to bd804ee Compare April 27, 2017 06:13
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @jrevels

@JeffBezanson JeffBezanson merged commit 8fcbd74 into master Apr 27, 2017
@JeffBezanson JeffBezanson deleted the jb/improve_serialize branch April 27, 2017 13:36
@quinnj
Copy link
Member

quinnj commented Apr 29, 2017

Hmmm....I thought I removed the dependency on past serializer versions. I'll take a look; thanks for the ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants