store/retrieve dependencies in .ji file #12445

stevengj · 2015-08-03T20:21:59Z

As discussed in #12259, as a prerequisite to automated image recompilation, this stores all of the dependencies of a module in the .ji file:

The modules that were depended on were already stored in the .ji header.
Now the .ji header also stores the pathnames of all files that were included by the compiled module and its submodules (but not files included by imported modules)
There is also a new include_dependency(filename) function to "manually" declare a dependency of the module on some other file that is not an included .jl file.
Base.cache_dependencies(cachefile) returns (modules, files) where modules is an array of the module names and files is an array of the filenames stored in the file.

To do:

document include_dependency
deal with endian-ness mismatch in cache_dependencies
test case

~~On the second point, I noticed something odd: the write_int32 functions etc. in src/dump.c read and write in little-endian order.~~ Updated: dump.c is changed to use bigendian (network) order for its metadata.

StefanKarpinski · 2015-08-03T20:24:12Z

Would it make sense for the format to be flexible enough to store a variety of properties of each dependency? E.g. have room for content hashes and timestamps?

stevengj · 2015-08-03T20:28:34Z

@StefanKarpinski, I don't see why timestamps need to be stored, because the only relevant timestamps are those of the files when the .ji is loaded. i.e. the procedure for load(cachefile) is:

read T = filesystem_timestamp(cachefile)
for dep in dependencies(cachefile) check filesystem_timestamp(dep) <= T.

Regarding hashes, we could always add that information later if we need it, assuming we add a header to the .ji file that includes format information (#12444). But right now it seems likely from the discussion that any automated build system will be based on timestamps (like make and friends, or like .pyc files).

(I agree that the content should be flexible, but that's why we need a version header in the file as per #12444.)

(but not recursively for modules required therein), and include_dependency(path) to manually add a dependency

lobingera · 2015-08-03T20:35:11Z

@stevengj
I just read this by coincidence, but as make is mentioned: Very strange things can happen with timestamps e.g. on filesystems mounted abroad on computers with non-syncronised clock. I recommend to go beyond timestamp and include something like a hash.

StefanKarpinski · 2015-08-03T20:37:20Z

Yes, tracking the timestamps may not be necessary; I was mainly thinking of hashes. I'm not sure that the choice needs to be either/or – you could have a system that can figure out what needs to be rebuilt from timestamps or from hashes if you happen to be on a system where timestamps are unreliable.

If the mechanism for associated data with dependencies is dict-like, then you don't need to bump version numbers for the .ji file format to allow flexibility. It seems better not to change the data format every time you might want to change what data you want to store. For example, you could decide at some point that SHA1 is no longer a good choice for hashing and start using SHA256 but continue to handle .ji files that have SHA1 hashes.

stevengj · 2015-08-03T20:37:58Z

Python has used timestamps to invalidate .pyc files for decades, with no apparent problems. (Checking timestamps is a lot faster than computing hashes; remember that this has to be checked every time a cached module is imported, potentially for a large number of files when you include the recursive dependencies.)

stevengj · 2015-08-03T20:40:25Z

@StefanKarpinski, storing the dependencies in JSON format (or any similar easily-parseable dict-like format) seems like serious overkill here. And worrying too much about backward-compatibiity of .ji files doesn't make much sense to me. The are just cache files; regenerating them in the rare cases where the format changes doesn't seem like a serious problem.

stevengj · 2015-08-03T20:48:46Z

Two reasons why unreliable network timestamps are not a big deal for this kind of application (as the Python experience has shown): (1) both the modules and the cache files will normally live on the same filesystem (wherever .julia lives) and (2) the time deltas when the cache is invalidated will typically be huge, so that small inaccuracies won't matter. Think of how this is used: I (a) install package Foo, (b) create the cache file, and then (c) update the package source later on, and then the cache is invalidated the next time I (d) import Foo. The typical time difference between (b) and (c) will be days or weeks for end users, and even for developers hacking on Foo it will typically be minutes.

stevengj · 2015-08-03T20:51:46Z

(Even for software development, is there any widely used make replacement that is not timestamp-based?)

pao · 2015-08-03T20:56:59Z

SCons (flexible, with md5sum+timestamp as the default). It's not exactly a speed demon, though.

My experiences with systems without real-time clocks have been recounted elsewhere. It is a niche, but it's a niche which includes the Raspberry Pi; do with that what you will.

stevengj · 2015-08-03T21:00:07Z

Where unreliable timestamps typically cause problems for make-like systems (make, cmake, ninja, cake .... SCons seems to be the exception that proves the rule here) is when the build system itself generates or modifies source files. That happens frequently in general-purpose build system, but isn't going to happen in Base.compile.

tkelman · 2015-08-03T21:04:05Z

Just because we haven't seen it yet, doesn't mean there won't be a good use case in the future where packages will want to dynamically generate Julia code and save it to the filesystem.

I've had clock skew complaints from make and cmake just due to frequently working over scp on a lot of Julia build-related files across computers whose clocks differ by several minutes. I don't think accounting for flexibility and extensibility would be a bad thing here.

stevengj · 2015-08-03T21:05:04Z

And again, if we ever find that we want a hash-based cache-invalidation system (which I doubt), and someone writes it, then we just bump the .ji version number in the header.

It just doesn't make sense to me to write the associated code (to compute SHAs and store them) now, when there's no immediate need for it and it can be easily added later. (Do we even have an SHA function in Base somewhere? I suppose there must be one in libgit somewhere.)

stevengj · 2015-08-03T21:07:11Z

@tkelman, flexibility and extensibility are added by sticking a version number in the .ji header.

(Packages dynamically generate Julia code now, during Pkg.build... but do we really want to encourage them to generate code and save it in the filesystem at require-time? I would seriously question the sanity of doing this, as opposed to ordinary metaprogramming.)

lobingera · 2015-08-03T21:22:49Z

@stevengj
both the modules and the cache files will normally live on the same filesystem (wherever .julia lives)
There were already questions on julia-users and dev about having a read-only julia installation of packages/modules. In this case the .ji have to live elsewhere anyway and therefore there is a non-vanishing propability they live on different filesystems.
And also if JSON dict seems to be overkill and could be implemented later in any case: We should learn a little bit from history and use without exception standard file formats (if applicable). The danger of another micro-parser-buffer-overflow story is just around the corner.

Which btw: raises the question: Do we need / Can we apply code signing for the .ji, so injection can be subverted?

tkelman · 2015-08-03T21:23:02Z

Well I agree with #12444 entirely. Enough people are clamoring that "timestamps are unreliable" that "there's no immediate need for it" remains controversial, and presumably someone wants to actually follow through with implementing something fancier.

Do we even have an SHA function in Base somewhere? I suppose there must be one in libgit somewhere.

There's SHA.jl (not currently in base) which BinDeps uses. Would have to check whether libgit2 exposes anything for general-purpose file hashing.

do we really want to encourage them to generate code and save it in the filesystem at require-time? I would seriously question the sanity of doing this, as opposed to ordinary metaprogramming.

Ordinary metaprogramming would have to be coupled with serialization and/or IPC if you want to make generated code usable as callbacks from a non-Julia language. Going through the filesystem may be simpler for some use cases.

stevengj · 2015-08-03T21:30:33Z

Some points:

The initial cache-rebuilding system will almost certainly be timestamp-based, for both speed and simplicity — it doesn't make sense to implement something slower and more complicated until we try it out and see whether this is a problem in practice for our .pyc-like caching system, not just in general-purpose build systems (and even there it is rare enough that most make-like systems don't bother with hashes).
Adding it now would entail a lot of additional complexity (e.g. folding SHA.jl into Base).
Adding it later would be painless (at least, no more work than adding it now), because you would just invalidate the old .ji files.

The question is, why is it necessary to bite the bullet and add hashing now? Which of the above points do you disagree with?

stevengj · 2015-08-03T21:34:32Z

(I don't understand your claim: Ordinary metaprogramming would have to be coupled with serialization and/or IPC if you want to make generated code usable as callbacks. eval-ed or include_string-ed code is no different from code include-ed from a file. It is perfectly possible to make a cfunction callback that is written with metaprogramming.)

tkelman · 2015-08-03T21:58:26Z

It is perfectly possible to make a cfunction callback that is written with metaprogramming.

Of course it is. But if you then want to AOT compile that generated callback code into a shared library on disk that another language could use, and periodically make updates to the generated code and recompile online, and execution and generation are not necessarily happening on the same machine, it's not clear any currently existing system is going to handle that well.

Or if other languages are responsible for generating the julia source code in the first place - such as the Excel-JuMP links in OpenSolver, for example.

I think people are mostly disagreeing with point 1, anticipating that this will be a problem in practice. But I don't feel strongly enough about it to say that we shouldn't evaluate this to see how well it works under basic usage.

stevengj · 2015-08-03T23:27:50Z

Here are the options:

Build an initial .ji recompilation system based on timestamps. See how it works before moving to anything more complicated.
Refuse to accept any .ji recompilation unless it is based on hashes. If that is the case, you need to merge a hash algorithm into Base before any work can proceed.

If we choose the first option, then there is no point to adding hashes to the .ji file now — they can always be added later by invalidating the cache. (Currently every Julia build invalidates the cache anyway, because the cache files depend on a timestamp of the Core and Base packages.)

blakejohnson · 2015-08-04T01:18:35Z

I agree with @stevengj: start with a simple design that has a clear path for evolving it if necessary.

ScottPJones · 2015-08-04T01:21:53Z

Do you really need something as slow and fancy as SHA* or md5?
In practice, I found using CRC-32 worked fine for cache invalidation.

timholy · 2015-08-04T02:51:40Z

I just tested CRC32 on every *.jl file in my .julia/v0.4 directory (something like 2500 files). First execution was 1.2 s, second execution was 0.36 s. For comparison, on this system using Gadfly now runs in about 4.1 s (precompiled, of course).

Not negligible compared to package-loading times, but also not dominant. I'm not sure how many of those 2500 files get loaded by using Gadfly, but I bet it's a pretty sizable fraction.

ScottPJones · 2015-08-04T03:04:57Z

Which CRC-32 implementation did you use? I've seen one in a CRC32.jl package, and in CRC.jl.
CRC32.jl looks pretty simple compared to the C version I used to use, and CRC.jl looks pretty complex, not sure how it would perform. Since it was for cache invalidation, I was able to play tricks like always do 8 or 16 bytes at a time, and for the last chunk of less than 8/16 bytes (if any), just pad with 0 bytes.

timholy · 2015-08-04T03:18:54Z

Just the unix crc32 utility: fls=$(find . -iname "*.jl") && time crc32 $fls. I guess I should point out that this also included the time to print all the results, and that could have dominated the time.

ScottPJones · 2015-08-04T03:25:08Z

That's also including directory scanning operations and time reading the files.
If it is done as part of the compilation process, then you have to load the files anyway to compile them, so that wouldn't happen twice. That also might dominate.

timholy · 2015-08-04T03:51:31Z

using CRC32

function scanjl!(jl, base)
    fls = readdir(base)
    for fl in fls
        flfull = joinpath(base, fl)
        if isdir(flfull)
            scanjl!(jl, flfull)
        elseif isfile(flfull) && endswith(flfull, ".jl")
            push!(jl, flfull)
        end
    end
    jl
end

function docrc(jl)
    s = 0.0
    for fl in jl
        c = crc32(readall(fl), UInt32(0))
        s += c
    end
    s
end

@time jl = scanjl!(ByteString[], "/home/tim/.julia/v0.4")
@show length(jl)
@time docrc(jl)

Results:

julia> include("/tmp/docrc.jl")
  4.262114 seconds (2.58 M allocations: 143.339 MB, 1.24% gc time)
length(jl) = 2772
  0.158719 seconds (83.18 k allocations: 19.643 MB, 4.24% gc time)
5.972116115294e12

CRC32 computation seems pretty fast. (But that scanjl! is hilariously slow compared to the Unix find command.)

timholy · 2015-08-04T03:57:46Z

If it is done as part of the compilation process, then you have to load the files anyway to compile them, so that wouldn't happen twice.

No, the point is that even when X is precompiled and you just want to using X, you also have to readall each *.jl file each time---otherwise, how do you tell whether the programmer changed one of the files that X depends on? This is @stevengj's main concern, I think. But my tests suggest that if you already have the list of *.jl files, the read & CRC32 step isn't really all that bad.

Of course, if our packages become even 10x faster to load someday, then this will become a big performance hit.

stevengj · 2015-08-04T04:44:23Z

Performance aside, I think it's also unnecessary complexity for a first-pass build system, considering that checksums/hashes can easily be added on later. (Why should this work wait on us deciding on and implementing/merging a hash algorithm?)

Meanwhile, @JeffBezanson, do you have any comment on the endian-ness question at the top? It seems like those read/write functions in dump.c were written in the distant past (< 2011), so probably only you know why they were written in this way.

vtjnash · 2015-08-04T04:56:24Z

I've had clock skew complaints from make and cmake just due to frequently working over scp on a lot of Julia build-related files across computers whose clocks differ by several minutes. I don't think accounting for flexibility and extensibility would be a bad thing here.

me too. my solution has typically been to just fix the clocks. but i guess i've usually had the luxury of having admin access on almost every machine i've ever had to work with.

another option is to just include a (gzip?) copy of all sources in the .ji output. this could be useful for generating debug info anyways and probably is not significantly more costly than computing a CRC

timholy · 2015-08-04T09:23:31Z

another option is to just include a (gzip?) copy of all sources in the .ji output. this could be useful for generating debug info anyways and probably is not significantly more costly than computing a CRC

Just to check, you're proposing a byte-by-byte comparison as the test for source-code change?

timholy · 2015-08-04T09:35:41Z

@stevengj, I would go with native endianness---after all, these are files meant to be executed on the local machine, not some data-storage format. Probably add ENDIAN_BOM right after the magic.

IO is a bit of a mess with regards to endianness, but there's also been persistent interest in viewing the problem this way: http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html.

stevengj · 2015-08-04T14:17:19Z

Guys, think of this from a software-development perspective. By initially having the build-system be timestamp-based, we can process the PR's in smaller parallel chunks — a PR like this to cache the dependencies, a PR to do the auto-rebuild based on timestamp comparisons, a PR to implement a checksum algorithm, and finally a PR to add checksum comparisons.

Can we have a separate issue to argue about which (if any) checksum algorithm to use, and then a separate PR to implement it? Right now this PR has been completely overwhelmed by discussions that are not directly relevant to caching the dependencies, which is what I'm trying to do here.

stevengj · 2015-08-04T14:21:57Z

@timholy, I agree about adopting native endianness. I was worried because I thought that there might be some other place in the code that depended on the endian choice in dump.c, but it looks like that is not the case. I tried a fresh patch to change dump.c to bigendian, and the tests pass, so it looks like only these write_* and read_* functions matter:

--- a/src/dump.c
+++ b/src/dump.c
@@ -134,44 +134,44 @@ static jl_array_t *datatype_list=NULL; // (only used in MODE_SYSTEM_IMAGE)

 static void write_int32(ios_t *s, int32_t i)
 {
-    write_uint8(s, i       & 0xff);
-    write_uint8(s, (i>> 8) & 0xff);
-    write_uint8(s, (i>>16) & 0xff);
     write_uint8(s, (i>>24) & 0xff);
+    write_uint8(s, (i>>16) & 0xff);
+    write_uint8(s, (i>> 8) & 0xff);
+    write_uint8(s, i       & 0xff);
 }

 static int32_t read_int32(ios_t *s)
 {
-    int b0 = read_uint8(s);
-    int b1 = read_uint8(s);
-    int b2 = read_uint8(s);
     int b3 = read_uint8(s);
+    int b2 = read_uint8(s);
+    int b1 = read_uint8(s);
+    int b0 = read_uint8(s);
     return b0 | (b1<<8) | (b2<<16) | (b3<<24);
 }

 static void write_uint64(ios_t *s, uint64_t i)
 {
-    write_int32(s, i       & 0xffffffff);
     write_int32(s, (i>>32) & 0xffffffff);
+    write_int32(s, i       & 0xffffffff);
 }

 static uint64_t read_uint64(ios_t *s)
 {
-    uint64_t b0 = (uint32_t)read_int32(s);
     uint64_t b1 = (uint32_t)read_int32(s);
+    uint64_t b0 = (uint32_t)read_int32(s);
     return b0 | (b1<<32);
 }

 static void write_uint16(ios_t *s, uint16_t i)
 {
-    write_uint8(s, i       & 0xff);
     write_uint8(s, (i>> 8) & 0xff);
+    write_uint8(s, i       & 0xff);
 }

 static uint16_t read_uint16(ios_t *s)
 {
-    int b0 = read_uint8(s);
     int b1 = read_uint8(s);
+    int b0 = read_uint8(s);
     return b0 | (b1<<8);
 }

carnaval · 2015-08-04T14:22:50Z

On the endianness thing, I think it's just better to pick one as a definition of the storage format.

See for example https://lkml.org/lkml/2015/4/22/628. It's about a network protocol but it applies to storage format. Even if it's rare to exchange .ji file, we still should make it possible given the negligible cost.

stevengj · 2015-08-04T14:23:50Z

@carnaval, that's fine, but if we are going to pick a canonical endianness for I/O, I think it should be bigendian. Otherwise I'll need to re-implement the read_int32 functions etc. in Julia; it's easy, of course, but it is also silly to have these when we have hton functions for canonicalizing the byte order already.

(And yes @timholy, I know that you can read them byte-by-byte as in read_int32. That's still a lot longer to write than ntoh(read(s, UInt64)).)

carnaval · 2015-08-04T14:27:54Z

big endian works for me, we can even use the hton* family for that.

aviks · 2015-08-04T14:39:45Z

I'm with @stevengj on this. Let's stick a version number in the image, and then build the rest of the functionality incrementally.

johnmyleswhite · 2015-08-04T14:44:51Z

+1 to that

stevengj · 2015-08-04T14:47:52Z

(Realize, of course, that there is other endian-specific binary data in the file anyway. The analogy with network protocols just doesn't make a lot of sense for a cache file. But since there is no penalty to reading and writing in network order, I suppose that we might as well.)

stevengj · 2015-08-04T15:14:38Z

Okay, changed the serialization functions to use bigendian order, added a test case, and added docs for include_dependency.

lobingera · 2015-08-04T15:15:35Z

Hello colleagues,

i'm +1 for doing a proof-of-concept right now, using timestamps, and just put a version number there. But all the other points: hash, JSON (or similar) as standard type shouldn't be forgotten.

stevengj · 2015-08-04T17:49:48Z

Pushed an update so that cache_dependencies returns the module "UUID" (actually just a timestamp) that was already stored in the .ji file. For an automated build system, this is important to check in the case where the module has already been imported into Julia (in which case the filesystem timestamp for the module is irrelevant until a new Julia session is started).

ScottPJones · 2015-08-04T18:54:34Z

But since there is no penalty to reading and writing in network order, I suppose that we might as well.)

Why is that so? I've always seen a big difference between reading/writing in native order and "network order" (unless you are on a big-endian machine).
Even IBM is moving to LE for its Linux on Power stuff, when I last talked to their people.
Right now, and most likely forever, most Julia is going to be LE, so why pay a penalty to byte swap values?
I also recall that recently there was some added support in Julia for reading native endian integers faster.

ScottPJones · 2015-08-04T19:01:59Z

The typical time difference between (b) and (c) will be days or weeks for end users, and even for developers hacking on Foo it will typically be minutes.

Only if the only things being compiled are human generated.

Think of hundreds or thousands of processes, distributed across a cluster of multi-core machines, or even distributed geographically, all sharing source code and object code, object code which can be compiled on any node, all doing constant updates to code from code generation, such as generating code from a SQL query, creating an optimized iterator, search function, etc. Using timestamps never worked for that, in my experience. I think it would be rather sad if Julia, that wants to be able to handle big data and parallel processing, can't handle things that a 40 year old "legacy" language can.

stevengj · 2015-08-04T20:40:24Z

@ScottPJones, (a) we're talking about a small amount of metadata for endian conversions (b) we're only talking about precompiled modules in Julia (and Base.compile only works from the master node). Arbitrary metaprogramming can already happen dynamically in Julia and is not affected by this.

stevengj · 2015-08-04T20:57:27Z

Closing, as this PR is superseded by #12458.

stevengj added the compiler:precompilation Precompilation of modules label Aug 3, 2015

stevengj mentioned this pull request Aug 3, 2015

static compile part 5 (automatic recompilation) #12259

Closed

add require(mod, true) to track dependencies of a required module

c1398a3

(but not recursively for modules required therein), and include_dependency(path) to manually add a dependency

stevengj force-pushed the ji_depends branch from 814776c to e41085d Compare August 3, 2015 20:35

stevengj force-pushed the ji_depends branch from e41085d to 06f8740 Compare August 4, 2015 15:13

stevengj mentioned this pull request Aug 4, 2015

fix #12444 (add magic and format version to beginning of .ji files) #12448

Merged

store the dependency list in the serialized image header

1090435

stevengj force-pushed the ji_depends branch from 06f8740 to 1090435 Compare August 4, 2015 17:48

stevengj mentioned this pull request Aug 4, 2015

automatic recompilation of stale cache files #12458

Merged

stevengj closed this Aug 4, 2015

store/retrieve dependencies in .ji file #12445

store/retrieve dependencies in .ji file #12445

Conversation

stevengj commented Aug 3, 2015

StefanKarpinski commented Aug 3, 2015

stevengj commented Aug 3, 2015

lobingera commented Aug 3, 2015

StefanKarpinski commented Aug 3, 2015

stevengj commented Aug 3, 2015

stevengj commented Aug 3, 2015

stevengj commented Aug 3, 2015

stevengj commented Aug 3, 2015

pao commented Aug 3, 2015

stevengj commented Aug 3, 2015

tkelman commented Aug 3, 2015

stevengj commented Aug 3, 2015

stevengj commented Aug 3, 2015

lobingera commented Aug 3, 2015

tkelman commented Aug 3, 2015

stevengj commented Aug 3, 2015

stevengj commented Aug 3, 2015

tkelman commented Aug 3, 2015

stevengj commented Aug 3, 2015

blakejohnson commented Aug 4, 2015

ScottPJones commented Aug 4, 2015

timholy commented Aug 4, 2015

ScottPJones commented Aug 4, 2015

timholy commented Aug 4, 2015

ScottPJones commented Aug 4, 2015

timholy commented Aug 4, 2015

timholy commented Aug 4, 2015

stevengj commented Aug 4, 2015

vtjnash commented Aug 4, 2015

timholy commented Aug 4, 2015

timholy commented Aug 4, 2015

stevengj commented Aug 4, 2015

stevengj commented Aug 4, 2015

carnaval commented Aug 4, 2015

stevengj commented Aug 4, 2015

carnaval commented Aug 4, 2015

aviks commented Aug 4, 2015

johnmyleswhite commented Aug 4, 2015 via email

stevengj commented Aug 4, 2015

stevengj commented Aug 4, 2015

lobingera commented Aug 4, 2015

stevengj commented Aug 4, 2015

ScottPJones commented Aug 4, 2015

ScottPJones commented Aug 4, 2015

stevengj commented Aug 4, 2015

stevengj commented Aug 4, 2015