Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/retrieve dependencies in .ji file #12445

Closed
wants to merge 2 commits into from

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Aug 3, 2015

As discussed in #12259, as a prerequisite to automated image recompilation, this stores all of the dependencies of a module in the .ji file:

  • The modules that were depended on were already stored in the .ji header.
  • Now the .ji header also stores the pathnames of all files that were included by the compiled module and its submodules (but not files included by imported modules)
  • There is also a new include_dependency(filename) function to "manually" declare a dependency of the module on some other file that is not an included .jl file.
  • Base.cache_dependencies(cachefile) returns (modules, files) where modules is an array of the module names and files is an array of the filenames stored in the file.

To do:

  • document include_dependency
  • deal with endian-ness mismatch in cache_dependencies
  • test case

On the second point, I noticed something odd: the write_int32 functions etc. in src/dump.c read and write in little-endian order. Updated: dump.c is changed to use bigendian (network) order for its metadata.

@stevengj stevengj added the compiler:precompilation Precompilation of modules label Aug 3, 2015
@StefanKarpinski
Copy link
Member

Would it make sense for the format to be flexible enough to store a variety of properties of each dependency? E.g. have room for content hashes and timestamps?

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

@StefanKarpinski, I don't see why timestamps need to be stored, because the only relevant timestamps are those of the files when the .ji is loaded. i.e. the procedure for load(cachefile) is:

  • read T = filesystem_timestamp(cachefile)
  • for dep in dependencies(cachefile) check filesystem_timestamp(dep) <= T.

Regarding hashes, we could always add that information later if we need it, assuming we add a header to the .ji file that includes format information (#12444). But right now it seems likely from the discussion that any automated build system will be based on timestamps (like make and friends, or like .pyc files).

(I agree that the content should be flexible, but that's why we need a version header in the file as per #12444.)

(but not recursively for modules required therein), and
include_dependency(path) to manually add a dependency
@lobingera
Copy link

@stevengj
I just read this by coincidence, but as make is mentioned: Very strange things can happen with timestamps e.g. on filesystems mounted abroad on computers with non-syncronised clock. I recommend to go beyond timestamp and include something like a hash.

@StefanKarpinski
Copy link
Member

Yes, tracking the timestamps may not be necessary; I was mainly thinking of hashes. I'm not sure that the choice needs to be either/or – you could have a system that can figure out what needs to be rebuilt from timestamps or from hashes if you happen to be on a system where timestamps are unreliable.

If the mechanism for associated data with dependencies is dict-like, then you don't need to bump version numbers for the .ji file format to allow flexibility. It seems better not to change the data format every time you might want to change what data you want to store. For example, you could decide at some point that SHA1 is no longer a good choice for hashing and start using SHA256 but continue to handle .ji files that have SHA1 hashes.

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

Python has used timestamps to invalidate .pyc files for decades, with no apparent problems. (Checking timestamps is a lot faster than computing hashes; remember that this has to be checked every time a cached module is imported, potentially for a large number of files when you include the recursive dependencies.)

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

@StefanKarpinski, storing the dependencies in JSON format (or any similar easily-parseable dict-like format) seems like serious overkill here. And worrying too much about backward-compatibiity of .ji files doesn't make much sense to me. The are just cache files; regenerating them in the rare cases where the format changes doesn't seem like a serious problem.

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

Two reasons why unreliable network timestamps are not a big deal for this kind of application (as the Python experience has shown): (1) both the modules and the cache files will normally live on the same filesystem (wherever .julia lives) and (2) the time deltas when the cache is invalidated will typically be huge, so that small inaccuracies won't matter. Think of how this is used: I (a) install package Foo, (b) create the cache file, and then (c) update the package source later on, and then the cache is invalidated the next time I (d) import Foo. The typical time difference between (b) and (c) will be days or weeks for end users, and even for developers hacking on Foo it will typically be minutes.

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

(Even for software development, is there any widely used make replacement that is not timestamp-based?)

@pao
Copy link
Member

pao commented Aug 3, 2015

SCons (flexible, with md5sum+timestamp as the default). It's not exactly a speed demon, though.

My experiences with systems without real-time clocks have been recounted elsewhere. It is a niche, but it's a niche which includes the Raspberry Pi; do with that what you will.

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

Where unreliable timestamps typically cause problems for make-like systems (make, cmake, ninja, cake .... SCons seems to be the exception that proves the rule here) is when the build system itself generates or modifies source files. That happens frequently in general-purpose build system, but isn't going to happen in Base.compile.

@tkelman
Copy link
Contributor

tkelman commented Aug 3, 2015

Just because we haven't seen it yet, doesn't mean there won't be a good use case in the future where packages will want to dynamically generate Julia code and save it to the filesystem.

I've had clock skew complaints from make and cmake just due to frequently working over scp on a lot of Julia build-related files across computers whose clocks differ by several minutes. I don't think accounting for flexibility and extensibility would be a bad thing here.

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

And again, if we ever find that we want a hash-based cache-invalidation system (which I doubt), and someone writes it, then we just bump the .ji version number in the header.

It just doesn't make sense to me to write the associated code (to compute SHAs and store them) now, when there's no immediate need for it and it can be easily added later. (Do we even have an SHA function in Base somewhere? I suppose there must be one in libgit somewhere.)

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

@tkelman, flexibility and extensibility are added by sticking a version number in the .ji header.

(Packages dynamically generate Julia code now, during Pkg.build... but do we really want to encourage them to generate code and save it in the filesystem at require-time? I would seriously question the sanity of doing this, as opposed to ordinary metaprogramming.)

@lobingera
Copy link

@stevengj
both the modules and the cache files will normally live on the same filesystem (wherever .julia lives)
There were already questions on julia-users and dev about having a read-only julia installation of packages/modules. In this case the .ji have to live elsewhere anyway and therefore there is a non-vanishing propability they live on different filesystems.
And also if JSON dict seems to be overkill and could be implemented later in any case: We should learn a little bit from history and use without exception standard file formats (if applicable). The danger of another micro-parser-buffer-overflow story is just around the corner.

Which btw: raises the question: Do we need / Can we apply code signing for the .ji, so injection can be subverted?

@tkelman
Copy link
Contributor

tkelman commented Aug 3, 2015

Well I agree with #12444 entirely. Enough people are clamoring that "timestamps are unreliable" that "there's no immediate need for it" remains controversial, and presumably someone wants to actually follow through with implementing something fancier.

Do we even have an SHA function in Base somewhere? I suppose there must be one in libgit somewhere.

There's SHA.jl (not currently in base) which BinDeps uses. Would have to check whether libgit2 exposes anything for general-purpose file hashing.

do we really want to encourage them to generate code and save it in the filesystem at require-time? I would seriously question the sanity of doing this, as opposed to ordinary metaprogramming.

Ordinary metaprogramming would have to be coupled with serialization and/or IPC if you want to make generated code usable as callbacks from a non-Julia language. Going through the filesystem may be simpler for some use cases.

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

Some points:

  • The initial cache-rebuilding system will almost certainly be timestamp-based, for both speed and simplicity — it doesn't make sense to implement something slower and more complicated until we try it out and see whether this is a problem in practice for our .pyc-like caching system, not just in general-purpose build systems (and even there it is rare enough that most make-like systems don't bother with hashes).
  • Adding it now would entail a lot of additional complexity (e.g. folding SHA.jl into Base).
  • Adding it later would be painless (at least, no more work than adding it now), because you would just invalidate the old .ji files.

The question is, why is it necessary to bite the bullet and add hashing now? Which of the above points do you disagree with?

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

(I don't understand your claim: Ordinary metaprogramming would have to be coupled with serialization and/or IPC if you want to make generated code usable as callbacks. eval-ed or include_string-ed code is no different from code include-ed from a file. It is perfectly possible to make a cfunction callback that is written with metaprogramming.)

@tkelman
Copy link
Contributor

tkelman commented Aug 3, 2015

It is perfectly possible to make a cfunction callback that is written with metaprogramming.

Of course it is. But if you then want to AOT compile that generated callback code into a shared library on disk that another language could use, and periodically make updates to the generated code and recompile online, and execution and generation are not necessarily happening on the same machine, it's not clear any currently existing system is going to handle that well.

Or if other languages are responsible for generating the julia source code in the first place - such as the Excel-JuMP links in OpenSolver, for example.

I think people are mostly disagreeing with point 1, anticipating that this will be a problem in practice. But I don't feel strongly enough about it to say that we shouldn't evaluate this to see how well it works under basic usage.

@stevengj
Copy link
Member Author

stevengj commented Aug 3, 2015

Here are the options:

  • Build an initial .ji recompilation system based on timestamps. See how it works before moving to anything more complicated.
  • Refuse to accept any .ji recompilation unless it is based on hashes. If that is the case, you need to merge a hash algorithm into Base before any work can proceed.

If we choose the first option, then there is no point to adding hashes to the .ji file now — they can always be added later by invalidating the cache. (Currently every Julia build invalidates the cache anyway, because the cache files depend on a timestamp of the Core and Base packages.)

@blakejohnson
Copy link
Contributor

I agree with @stevengj: start with a simple design that has a clear path for evolving it if necessary.

@ScottPJones
Copy link
Contributor

Do you really need something as slow and fancy as SHA* or md5?
In practice, I found using CRC-32 worked fine for cache invalidation.

@timholy
Copy link
Member

timholy commented Aug 4, 2015

I just tested CRC32 on every *.jl file in my .julia/v0.4 directory (something like 2500 files). First execution was 1.2 s, second execution was 0.36 s. For comparison, on this system using Gadfly now runs in about 4.1 s (precompiled, of course).

Not negligible compared to package-loading times, but also not dominant. I'm not sure how many of those 2500 files get loaded by using Gadfly, but I bet it's a pretty sizable fraction.

@ScottPJones
Copy link
Contributor

Which CRC-32 implementation did you use? I've seen one in a CRC32.jl package, and in CRC.jl.
CRC32.jl looks pretty simple compared to the C version I used to use, and CRC.jl looks pretty complex, not sure how it would perform. Since it was for cache invalidation, I was able to play tricks like always do 8 or 16 bytes at a time, and for the last chunk of less than 8/16 bytes (if any), just pad with 0 bytes.

@timholy
Copy link
Member

timholy commented Aug 4, 2015

Just the unix crc32 utility: fls=$(find . -iname "*.jl") && time crc32 $fls. I guess I should point out that this also included the time to print all the results, and that could have dominated the time.

@ScottPJones
Copy link
Contributor

That's also including directory scanning operations and time reading the files.
If it is done as part of the compilation process, then you have to load the files anyway to compile them, so that wouldn't happen twice. That also might dominate.

@timholy
Copy link
Member

timholy commented Aug 4, 2015

using CRC32

function scanjl!(jl, base)
    fls = readdir(base)
    for fl in fls
        flfull = joinpath(base, fl)
        if isdir(flfull)
            scanjl!(jl, flfull)
        elseif isfile(flfull) && endswith(flfull, ".jl")
            push!(jl, flfull)
        end
    end
    jl
end

function docrc(jl)
    s = 0.0
    for fl in jl
        c = crc32(readall(fl), UInt32(0))
        s += c
    end
    s
end

@time jl = scanjl!(ByteString[], "/home/tim/.julia/v0.4")
@show length(jl)
@time docrc(jl)

Results:

julia> include("/tmp/docrc.jl")
  4.262114 seconds (2.58 M allocations: 143.339 MB, 1.24% gc time)
length(jl) = 2772
  0.158719 seconds (83.18 k allocations: 19.643 MB, 4.24% gc time)
5.972116115294e12

CRC32 computation seems pretty fast. (But that scanjl! is hilariously slow compared to the Unix find command.)

@timholy
Copy link
Member

timholy commented Aug 4, 2015

If it is done as part of the compilation process, then you have to load the files anyway to compile them, so that wouldn't happen twice.

No, the point is that even when X is precompiled and you just want to using X, you also have to readall each *.jl file each time---otherwise, how do you tell whether the programmer changed one of the files that X depends on? This is @stevengj's main concern, I think. But my tests suggest that if you already have the list of *.jl files, the read & CRC32 step isn't really all that bad.

Of course, if our packages become even 10x faster to load someday, then this will become a big performance hit.

@stevengj
Copy link
Member Author

stevengj commented Aug 4, 2015

Performance aside, I think it's also unnecessary complexity for a first-pass build system, considering that checksums/hashes can easily be added on later. (Why should this work wait on us deciding on and implementing/merging a hash algorithm?)

Meanwhile, @JeffBezanson, do you have any comment on the endian-ness question at the top? It seems like those read/write functions in dump.c were written in the distant past (< 2011), so probably only you know why they were written in this way.

@vtjnash
Copy link
Member

vtjnash commented Aug 4, 2015

I've had clock skew complaints from make and cmake just due to frequently working over scp on a lot of Julia build-related files across computers whose clocks differ by several minutes. I don't think accounting for flexibility and extensibility would be a bad thing here.

me too. my solution has typically been to just fix the clocks. but i guess i've usually had the luxury of having admin access on almost every machine i've ever had to work with.

another option is to just include a (gzip?) copy of all sources in the .ji output. this could be useful for generating debug info anyways and probably is not significantly more costly than computing a CRC

@timholy
Copy link
Member

timholy commented Aug 4, 2015

another option is to just include a (gzip?) copy of all sources in the .ji output. this could be useful for generating debug info anyways and probably is not significantly more costly than computing a CRC

Just to check, you're proposing a byte-by-byte comparison as the test for source-code change?

@timholy
Copy link
Member

timholy commented Aug 4, 2015

@stevengj, I would go with native endianness---after all, these are files meant to be executed on the local machine, not some data-storage format. Probably add ENDIAN_BOM right after the magic.

IO is a bit of a mess with regards to endianness, but there's also been persistent interest in viewing the problem this way: http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html.

@stevengj
Copy link
Member Author

stevengj commented Aug 4, 2015

Guys, think of this from a software-development perspective. By initially having the build-system be timestamp-based, we can process the PR's in smaller parallel chunks — a PR like this to cache the dependencies, a PR to do the auto-rebuild based on timestamp comparisons, a PR to implement a checksum algorithm, and finally a PR to add checksum comparisons.

Can we have a separate issue to argue about which (if any) checksum algorithm to use, and then a separate PR to implement it? Right now this PR has been completely overwhelmed by discussions that are not directly relevant to caching the dependencies, which is what I'm trying to do here.

@stevengj
Copy link
Member Author

stevengj commented Aug 4, 2015

@timholy, I agree about adopting native endianness. I was worried because I thought that there might be some other place in the code that depended on the endian choice in dump.c, but it looks like that is not the case. I tried a fresh patch to change dump.c to bigendian, and the tests pass, so it looks like only these write_* and read_* functions matter:

--- a/src/dump.c
+++ b/src/dump.c
@@ -134,44 +134,44 @@ static jl_array_t *datatype_list=NULL; // (only used in MODE_SYSTEM_IMAGE)

 static void write_int32(ios_t *s, int32_t i)
 {
-    write_uint8(s, i       & 0xff);
-    write_uint8(s, (i>> 8) & 0xff);
-    write_uint8(s, (i>>16) & 0xff);
     write_uint8(s, (i>>24) & 0xff);
+    write_uint8(s, (i>>16) & 0xff);
+    write_uint8(s, (i>> 8) & 0xff);
+    write_uint8(s, i       & 0xff);
 }

 static int32_t read_int32(ios_t *s)
 {
-    int b0 = read_uint8(s);
-    int b1 = read_uint8(s);
-    int b2 = read_uint8(s);
     int b3 = read_uint8(s);
+    int b2 = read_uint8(s);
+    int b1 = read_uint8(s);
+    int b0 = read_uint8(s);
     return b0 | (b1<<8) | (b2<<16) | (b3<<24);
 }

 static void write_uint64(ios_t *s, uint64_t i)
 {
-    write_int32(s, i       & 0xffffffff);
     write_int32(s, (i>>32) & 0xffffffff);
+    write_int32(s, i       & 0xffffffff);
 }

 static uint64_t read_uint64(ios_t *s)
 {
-    uint64_t b0 = (uint32_t)read_int32(s);
     uint64_t b1 = (uint32_t)read_int32(s);
+    uint64_t b0 = (uint32_t)read_int32(s);
     return b0 | (b1<<32);
 }

 static void write_uint16(ios_t *s, uint16_t i)
 {
-    write_uint8(s, i       & 0xff);
     write_uint8(s, (i>> 8) & 0xff);
+    write_uint8(s, i       & 0xff);
 }

 static uint16_t read_uint16(ios_t *s)
 {
-    int b0 = read_uint8(s);
     int b1 = read_uint8(s);
+    int b0 = read_uint8(s);
     return b0 | (b1<<8);
 }

@carnaval
Copy link
Contributor

carnaval commented Aug 4, 2015

On the endianness thing, I think it's just better to pick one as a definition of the storage format.

See for example https://lkml.org/lkml/2015/4/22/628. It's about a network protocol but it applies to storage format. Even if it's rare to exchange .ji file, we still should make it possible given the negligible cost.

@stevengj
Copy link
Member Author

stevengj commented Aug 4, 2015

@carnaval, that's fine, but if we are going to pick a canonical endianness for I/O, I think it should be bigendian. Otherwise I'll need to re-implement the read_int32 functions etc. in Julia; it's easy, of course, but it is also silly to have these when we have hton functions for canonicalizing the byte order already.

(And yes @timholy, I know that you can read them byte-by-byte as in read_int32. That's still a lot longer to write than ntoh(read(s, UInt64)).)

@carnaval
Copy link
Contributor

carnaval commented Aug 4, 2015

big endian works for me, we can even use the hton* family for that.

@aviks
Copy link
Member

aviks commented Aug 4, 2015

I'm with @stevengj on this. Let's stick a version number in the image, and then build the rest of the functionality incrementally.

@johnmyleswhite
Copy link
Member

johnmyleswhite commented Aug 4, 2015 via email

@stevengj
Copy link
Member Author

stevengj commented Aug 4, 2015

(Realize, of course, that there is other endian-specific binary data in the file anyway. The analogy with network protocols just doesn't make a lot of sense for a cache file. But since there is no penalty to reading and writing in network order, I suppose that we might as well.)

@stevengj
Copy link
Member Author

stevengj commented Aug 4, 2015

Okay, changed the serialization functions to use bigendian order, added a test case, and added docs for include_dependency.

@lobingera
Copy link

Hello colleagues,

i'm +1 for doing a proof-of-concept right now, using timestamps, and just put a version number there. But all the other points: hash, JSON (or similar) as standard type shouldn't be forgotten.

@stevengj
Copy link
Member Author

stevengj commented Aug 4, 2015

Pushed an update so that cache_dependencies returns the module "UUID" (actually just a timestamp) that was already stored in the .ji file. For an automated build system, this is important to check in the case where the module has already been imported into Julia (in which case the filesystem timestamp for the module is irrelevant until a new Julia session is started).

@ScottPJones
Copy link
Contributor

But since there is no penalty to reading and writing in network order, I suppose that we might as well.)

Why is that so? I've always seen a big difference between reading/writing in native order and "network order" (unless you are on a big-endian machine).
Even IBM is moving to LE for its Linux on Power stuff, when I last talked to their people.
Right now, and most likely forever, most Julia is going to be LE, so why pay a penalty to byte swap values?
I also recall that recently there was some added support in Julia for reading native endian integers faster.

@ScottPJones
Copy link
Contributor

The typical time difference between (b) and (c) will be days or weeks for end users, and even for developers hacking on Foo it will typically be minutes.

Only if the only things being compiled are human generated.

Think of hundreds or thousands of processes, distributed across a cluster of multi-core machines, or even distributed geographically, all sharing source code and object code, object code which can be compiled on any node, all doing constant updates to code from code generation, such as generating code from a SQL query, creating an optimized iterator, search function, etc. Using timestamps never worked for that, in my experience. I think it would be rather sad if Julia, that wants to be able to handle big data and parallel processing, can't handle things that a 40 year old "legacy" language can.

@stevengj
Copy link
Member Author

stevengj commented Aug 4, 2015

@ScottPJones, (a) we're talking about a small amount of metadata for endian conversions (b) we're only talking about precompiled modules in Julia (and Base.compile only works from the master node). Arbitrary metaprogramming can already happen dynamically in Julia and is not affected by this.

@stevengj
Copy link
Member Author

stevengj commented Aug 4, 2015

Closing, as this PR is superseded by #12458.

@stevengj stevengj closed this Aug 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:precompilation Precompilation of modules
Projects
None yet
Development

Successfully merging this pull request may close these issues.