-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store/retrieve dependencies in .ji file #12445
Conversation
Would it make sense for the format to be flexible enough to store a variety of properties of each dependency? E.g. have room for content hashes and timestamps? |
@StefanKarpinski, I don't see why timestamps need to be stored, because the only relevant timestamps are those of the files when the
Regarding hashes, we could always add that information later if we need it, assuming we add a header to the (I agree that the content should be flexible, but that's why we need a version header in the file as per #12444.) |
(but not recursively for modules required therein), and include_dependency(path) to manually add a dependency
@stevengj |
Yes, tracking the timestamps may not be necessary; I was mainly thinking of hashes. I'm not sure that the choice needs to be either/or – you could have a system that can figure out what needs to be rebuilt from timestamps or from hashes if you happen to be on a system where timestamps are unreliable. If the mechanism for associated data with dependencies is dict-like, then you don't need to bump version numbers for the .ji file format to allow flexibility. It seems better not to change the data format every time you might want to change what data you want to store. For example, you could decide at some point that SHA1 is no longer a good choice for hashing and start using SHA256 but continue to handle .ji files that have SHA1 hashes. |
Python has used timestamps to invalidate |
@StefanKarpinski, storing the dependencies in JSON format (or any similar easily-parseable dict-like format) seems like serious overkill here. And worrying too much about backward-compatibiity of |
Two reasons why unreliable network timestamps are not a big deal for this kind of application (as the Python experience has shown): (1) both the modules and the cache files will normally live on the same filesystem (wherever |
(Even for software development, is there any widely used |
SCons (flexible, with md5sum+timestamp as the default). It's not exactly a speed demon, though. My experiences with systems without real-time clocks have been recounted elsewhere. It is a niche, but it's a niche which includes the Raspberry Pi; do with that what you will. |
Where unreliable timestamps typically cause problems for |
Just because we haven't seen it yet, doesn't mean there won't be a good use case in the future where packages will want to dynamically generate Julia code and save it to the filesystem. I've had clock skew complaints from make and cmake just due to frequently working over scp on a lot of Julia build-related files across computers whose clocks differ by several minutes. I don't think accounting for flexibility and extensibility would be a bad thing here. |
And again, if we ever find that we want a hash-based cache-invalidation system (which I doubt), and someone writes it, then we just bump the It just doesn't make sense to me to write the associated code (to compute SHAs and store them) now, when there's no immediate need for it and it can be easily added later. (Do we even have an SHA function in Base somewhere? I suppose there must be one in libgit somewhere.) |
@tkelman, flexibility and extensibility are added by sticking a version number in the (Packages dynamically generate Julia code now, during |
@stevengj Which btw: raises the question: Do we need / Can we apply code signing for the .ji, so injection can be subverted? |
Well I agree with #12444 entirely. Enough people are clamoring that "timestamps are unreliable" that "there's no immediate need for it" remains controversial, and presumably someone wants to actually follow through with implementing something fancier.
There's SHA.jl (not currently in base) which BinDeps uses. Would have to check whether libgit2 exposes anything for general-purpose file hashing.
Ordinary metaprogramming would have to be coupled with serialization and/or IPC if you want to make generated code usable as callbacks from a non-Julia language. Going through the filesystem may be simpler for some use cases. |
Some points:
The question is, why is it necessary to bite the bullet and add hashing now? Which of the above points do you disagree with? |
(I don't understand your claim: Ordinary metaprogramming would have to be coupled with serialization and/or IPC if you want to make generated code usable as callbacks. |
Of course it is. But if you then want to AOT compile that generated callback code into a shared library on disk that another language could use, and periodically make updates to the generated code and recompile online, and execution and generation are not necessarily happening on the same machine, it's not clear any currently existing system is going to handle that well. Or if other languages are responsible for generating the julia source code in the first place - such as the Excel-JuMP links in OpenSolver, for example. I think people are mostly disagreeing with point 1, anticipating that this will be a problem in practice. But I don't feel strongly enough about it to say that we shouldn't evaluate this to see how well it works under basic usage. |
Here are the options:
If we choose the first option, then there is no point to adding hashes to the |
I agree with @stevengj: start with a simple design that has a clear path for evolving it if necessary. |
Do you really need something as slow and fancy as SHA* or md5? |
I just tested CRC32 on every Not negligible compared to package-loading times, but also not dominant. I'm not sure how many of those 2500 files get loaded by |
Which CRC-32 implementation did you use? I've seen one in a CRC32.jl package, and in CRC.jl. |
Just the unix |
That's also including directory scanning operations and time reading the files. |
using CRC32
function scanjl!(jl, base)
fls = readdir(base)
for fl in fls
flfull = joinpath(base, fl)
if isdir(flfull)
scanjl!(jl, flfull)
elseif isfile(flfull) && endswith(flfull, ".jl")
push!(jl, flfull)
end
end
jl
end
function docrc(jl)
s = 0.0
for fl in jl
c = crc32(readall(fl), UInt32(0))
s += c
end
s
end
@time jl = scanjl!(ByteString[], "/home/tim/.julia/v0.4")
@show length(jl)
@time docrc(jl) Results: julia> include("/tmp/docrc.jl")
4.262114 seconds (2.58 M allocations: 143.339 MB, 1.24% gc time)
length(jl) = 2772
0.158719 seconds (83.18 k allocations: 19.643 MB, 4.24% gc time)
5.972116115294e12 CRC32 computation seems pretty fast. (But that |
No, the point is that even when Of course, if our packages become even 10x faster to load someday, then this will become a big performance hit. |
Performance aside, I think it's also unnecessary complexity for a first-pass build system, considering that checksums/hashes can easily be added on later. (Why should this work wait on us deciding on and implementing/merging a hash algorithm?) Meanwhile, @JeffBezanson, do you have any comment on the endian-ness question at the top? It seems like those |
me too. my solution has typically been to just fix the clocks. but i guess i've usually had the luxury of having admin access on almost every machine i've ever had to work with. another option is to just include a (gzip?) copy of all sources in the |
Just to check, you're proposing a byte-by-byte comparison as the test for source-code change? |
@stevengj, I would go with native endianness---after all, these are files meant to be executed on the local machine, not some data-storage format. Probably add IO is a bit of a mess with regards to endianness, but there's also been persistent interest in viewing the problem this way: http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html. |
Guys, think of this from a software-development perspective. By initially having the build-system be timestamp-based, we can process the PR's in smaller parallel chunks — a PR like this to cache the dependencies, a PR to do the auto-rebuild based on timestamp comparisons, a PR to implement a checksum algorithm, and finally a PR to add checksum comparisons. Can we have a separate issue to argue about which (if any) checksum algorithm to use, and then a separate PR to implement it? Right now this PR has been completely overwhelmed by discussions that are not directly relevant to caching the dependencies, which is what I'm trying to do here. |
@timholy, I agree about adopting native endianness. I was worried because I thought that there might be some other place in the code that depended on the endian choice in --- a/src/dump.c
+++ b/src/dump.c
@@ -134,44 +134,44 @@ static jl_array_t *datatype_list=NULL; // (only used in MODE_SYSTEM_IMAGE)
static void write_int32(ios_t *s, int32_t i)
{
- write_uint8(s, i & 0xff);
- write_uint8(s, (i>> 8) & 0xff);
- write_uint8(s, (i>>16) & 0xff);
write_uint8(s, (i>>24) & 0xff);
+ write_uint8(s, (i>>16) & 0xff);
+ write_uint8(s, (i>> 8) & 0xff);
+ write_uint8(s, i & 0xff);
}
static int32_t read_int32(ios_t *s)
{
- int b0 = read_uint8(s);
- int b1 = read_uint8(s);
- int b2 = read_uint8(s);
int b3 = read_uint8(s);
+ int b2 = read_uint8(s);
+ int b1 = read_uint8(s);
+ int b0 = read_uint8(s);
return b0 | (b1<<8) | (b2<<16) | (b3<<24);
}
static void write_uint64(ios_t *s, uint64_t i)
{
- write_int32(s, i & 0xffffffff);
write_int32(s, (i>>32) & 0xffffffff);
+ write_int32(s, i & 0xffffffff);
}
static uint64_t read_uint64(ios_t *s)
{
- uint64_t b0 = (uint32_t)read_int32(s);
uint64_t b1 = (uint32_t)read_int32(s);
+ uint64_t b0 = (uint32_t)read_int32(s);
return b0 | (b1<<32);
}
static void write_uint16(ios_t *s, uint16_t i)
{
- write_uint8(s, i & 0xff);
write_uint8(s, (i>> 8) & 0xff);
+ write_uint8(s, i & 0xff);
}
static uint16_t read_uint16(ios_t *s)
{
- int b0 = read_uint8(s);
int b1 = read_uint8(s);
+ int b0 = read_uint8(s);
return b0 | (b1<<8);
} |
On the endianness thing, I think it's just better to pick one as a definition of the storage format. See for example https://lkml.org/lkml/2015/4/22/628. It's about a network protocol but it applies to storage format. Even if it's rare to exchange .ji file, we still should make it possible given the negligible cost. |
@carnaval, that's fine, but if we are going to pick a canonical endianness for I/O, I think it should be bigendian. Otherwise I'll need to re-implement the (And yes @timholy, I know that you can read them byte-by-byte as in |
big endian works for me, we can even use the hton* family for that. |
I'm with @stevengj on this. Let's stick a version number in the image, and then build the rest of the functionality incrementally. |
+1 to that
|
(Realize, of course, that there is other endian-specific binary data in the file anyway. The analogy with network protocols just doesn't make a lot of sense for a cache file. But since there is no penalty to reading and writing in network order, I suppose that we might as well.) |
Okay, changed the serialization functions to use bigendian order, added a test case, and added docs for |
Hello colleagues, i'm +1 for doing a proof-of-concept right now, using timestamps, and just put a version number there. But all the other points: hash, JSON (or similar) as standard type shouldn't be forgotten. |
Pushed an update so that |
Why is that so? I've always seen a big difference between reading/writing in native order and "network order" (unless you are on a big-endian machine). |
Only if the only things being compiled are human generated. Think of hundreds or thousands of processes, distributed across a cluster of multi-core machines, or even distributed geographically, all sharing source code and object code, object code which can be compiled on any node, all doing constant updates to code from code generation, such as generating code from a SQL query, creating an optimized iterator, search function, etc. Using timestamps never worked for that, in my experience. I think it would be rather sad if Julia, that wants to be able to handle big data and parallel processing, can't handle things that a 40 year old "legacy" language can. |
@ScottPJones, (a) we're talking about a small amount of metadata for endian conversions (b) we're only talking about precompiled modules in Julia (and |
Closing, as this PR is superseded by #12458. |
As discussed in #12259, as a prerequisite to automated image recompilation, this stores all of the dependencies of a module in the
.ji
file:.ji
header..ji
header also stores the pathnames of all files that wereincluded
by the compiled module and its submodules (but not filesincluded
by imported modules)include_dependency(filename)
function to "manually" declare a dependency of the module on some other file that is not anincluded
.jl
file.Base.cache_dependencies(cachefile)
returns(modules, files)
wheremodules
is an array of the module names andfiles
is an array of the filenames stored in the file.To do:
include_dependency
cache_dependencies
On the second point, I noticed something odd: theUpdated:write_int32
functions etc. insrc/dump.c
read and write in little-endian order.dump.c
is changed to use bigendian (network) order for its metadata.