-
Notifications
You must be signed in to change notification settings - Fork 5
Ice velocity integration #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #97 +/- ##
===========================================
+ Coverage 47.09% 76.52% +29.43%
===========================================
Files 19 20 +1
Lines 879 997 +118
===========================================
+ Hits 414 763 +349
+ Misses 465 234 -231 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f9a2fb4
to
f9967d9
Compare
…er2D + move to Rasters.jl to load surface velocity datacube + fix several bugs with projection
…ake datacube for testing purposes
836245b
to
3a50573
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job! This is a lot of foundational work, it will be very cool to be able to leverage these datasets from now on.
I just have some minor comments and some questions (purely for my curiosity). Feel free to change whatever seems important and then we can merge.
@@ -13,6 +13,7 @@ CoordRefSystems = "b46f11dc-f210-4604-bfba-323c1ec968cb" | |||
DataStructures = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8" | |||
DateFormats = "44557152-fe0a-4de1-8405-416d90313ce6" | |||
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a" | |||
DimensionalData = "0703355e-b756-11e9-17c0-8b28908087d0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already available from Rasters.jl
. Did you add it just to have the necessary exported namespaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is just to access Dim
in Sleipnir/test/surface_velocity.jl
for example. If that bothers you we can probably do Rasters.Dim
instead.
@@ -12,6 +12,7 @@ using JLD2 | |||
using Distributed | |||
using Statistics | |||
using CairoMakie | |||
using Observables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for? I've seen it's some sort of improved Ref
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in lines like
Observables.connect!(cb.height, @lift CairoMakie.Fixed($(viewport(ax.scene)).widths[2]))
whose purpose is to make the colobar have the same height as the heatmap we are plotting.
- `date1::Union{Vector{DateTime}, Nothing}`: First date of adquisition. | ||
- `date2::Union{Vector{DateTime}, Nothing}`: Second date of adquisition. | ||
- `date_error::Union{Vector{Day}, Vector{Millisecond}, Nothing}`: Error in `date`. | ||
- `glacierGridded::Bool`: Whether the data have been gridded to the glacier grid or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not entirely clear what you mean by this. Is there a case where the date is not gridded to the glacier grid? Like, do you use the same struct to store the original raw data and the final data in the glacier grid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we use the same struct to store the original raw data and the gridded ones. I don't think we need a specific struct to store the original raw data. Doing so will even come with many memory issues as here we load the data in a lazy mode only when we need them and many of the points in the original data aren't useful since they are outside of the glacier grid.
Most of the time, we probably won't use the raw data directly. However, it can be convenient in some cases for debugging or to visualize data of the netcdf files.
I changed the name to isGridGlacierAligned
to make it clearer.
|
||
Constructor for ice surface velocity data based on Rabatel et. al (2023). | ||
|
||
|
||
Important remarks: | ||
- Projections in longitude and latitude assume we are working in the north hemisphere. | ||
If working with south hemisphere glaciers, this needs to be changed. | ||
- The error in velocity is unique per timestamp, rather than being pixel distributed. | ||
- The error in the absolute velocities `vabs_error` is overestimated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this? Does this come from the paper(s)? And by oversestimated you mean that the error is larger than reported or lower than reported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this agreement about the northern hemisphere is quite strong. At the very least I would give a warning somehow (e.g. in case there is a way to infer that we are not in the northern hemisphere) and clearly document this in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerning the errors, this is something that @facusapienza21 implemented, so I'm not 100% sure of the answer. I guess this is because for every point of the grid, we don't have the velocities for every time and we use an upper bound by taking for every grid point the maximum speed over time. In the end the error is overestimated. @facusapienza21 am I right?
As for the northern hemisphere restriction, actually this is something that I removed in this PR and the code now supports data in the southern hemisphere too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember exactly what was the problem, but as long as you can specify if the projection is done in the north or south hemisphere it should be fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think right now we can just do north hemisphere unless we change this code:
# Spatial preprocessing
proj_zone = ncgetatt(file, "mapping", "utm_zone_number")
transform(X,Y) = Sleipnir.UTMercator(X, Y; zone=proj_zone, hemisphere=:north)
Right now this is done for north hemisphere. I think this was also something from the dataset: since the dataset for now is just for the north hemisphere, it is not as clear what will be the argument in the future from the dataset to determine if the datacube is north or south hemisphere. This is a good question for Romain/Antoine, but for now I will keep the comment until we are sure the preprocessing will work with South data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I suggest we open an issue to keep track of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I already changed that line in the PR. The code is now:
hemisphere = (y[end]+y[begin])/2 >= 0 ? :north : :south
transform(X,Y) = Sleipnir.UTMercator(X, Y; zone=Int(params_projection["zone"]), hemisphere=hemisphere)
Since in the UTM coordinate system, y
corresponds to the northing, then y=0
corresponds to the equator (I double checked this numerically).
If you think there can be bugs, we can add an assert, put back the remark in the documentation and create an issue for that, but I think that this code will work with dataset in the south hemisphere.
compute_vabs_error::Bool=true | ||
) where {G <: AbstractGlacier, VM <: VelocityMapping} | ||
|
||
velRast = isa(raster, String) ? RasterStack(raster, lazy=true) : raster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any penalty or downside in using the Rasters in lazy mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experience it's slower when we read the data. But the most important point is that for one of the datacubes I've been playing with, it is just impossible to load it a non lazy way. In the netcdf file data are represented in a sparse way on disk, but once it gets loaded into memory, everything is dense and this represents more than 15GB of RAM!
abstract type VelocityMapping end | ||
|
||
""" | ||
MeanDateVelocityMapping <: VelocityMapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused to see mentioned here the iceflow model for a tool which in theory only serves to map the grid from the observations to the grid we use the model the glacier. I understand that the way we associate that to a given date will impact our interpretation of how we can compare it to the model, but this seems overly rigid. The fact that we know the start and end date of the observations used to compute the velocity is very useful and can be used when comparing it to model outputs.
There seems to be a big gap in terms of complexity between the two mapping approaches. Here's a suggestion: what about enabling the addition of a more data points in the meanDateVelocityMapping
to go beyond just the middle date? If we sample more data points from outputs of the iceflow model and we average them, it should be increasingly robust. This should be too computationally expensive, it would just imply storing a denser solution and then averaging over an increased number of model outputs. The number of model outputs used could be somehow proportional to the length of the time window used in the calculation of the surface velocity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there are two different things that I mixed together: the spatial interpolation and the temporal interpolation. This is why the ice flow model is mentioned here, but we don't even care about it in Sleipnir. These are just structs to define what flavors of spatio-temporal mappings are implemented.
As discussed together today, I agree that these approaches are the two extremes in terms of complexity and we'll likely implement a simplification of the second one. We could either implement that in IntegratedTrajectoryMapping
by using several hyper-parameters for allows reducing the number of time steps that are used, or implement it in another struct that is a subtype of VelocityMapping
.
We'll probably implement these mappings in other PRs. All of this is exciting and I can't wait to start to play with these mappings!
if isdir(prepro_dir) | ||
daysSinceLastDownload = (Dates.now() - Dates.unix2datetime(mtime(prepro_dir)))/Day(1) | ||
if daysSinceLastDownload > 1 | ||
# Re-download data if older than one day |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a bit extreme? Why should files change that much after just one day?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if you switch to my branch and you don't download the data, the code will crash. This is just to ensure that this situation doesn't happen and that you have the most up-to-date preprocessed data when you switch to a version of the code that needs it. In production we can probably comment these lines but for development this makes things so much easier. I would be happy to find a workaround as I agree that it's a bit extreme for most users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think this is now doable because we just have few data in the google drive. If we implemented this with the Bremen server would it attempt to download the whole thing? Or just the necessary gdir
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would attempt to download the whole thing. But a cleaner way to do that would be to compute a checksum locally (or just read a version number) and compare it with a checksum on the server. If the two doesn't match, then we download the data. We could also do that silently with a try/catch so that if the user doesn't have an internet connection, it skips the downloading and uses the local version.
I created an issue for this specific task: #104
This looks great @albangossard !!! Thank you for the work! I left comment in the review. I have a more general question. I think there are two different approaches in how one could handle the fact that PDE simulations are gridded while the observations are not:
In this PR, @albangossard you implemented the first one, right? So now all the dataset is gridded in a raster with the same size than the target glacier. However, in principle the second approach is also valid, and I priori I was thinking in moving forward with doing that: just run the PDE on the grid and then make some interpolation to evaluate where there is data. For me the second approach was a bit better since it leaves the data unaltered, while it is the numerical model the one that needs to adapt to how the data is spatially/temporally distributed. Furthermore, to evaluate a loss function we don't need the data to be gridded "in time" (I agree that in space this makes more sense), so there is no reason to pack the data in discrete intervals of time: all this can be handled by the model itself when computing the loss function (a numerical integral) or the gradient/adjoint (another numerical integral). We can definitely try this implementation first, but in principle both approaches are valid. Do you have any thoughts on this @albangossard or @JordiBolibar ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't fully undertand how the review works, because I always leave comments and add things in the code but it does not show up as "review". I am adding this now so it shows as I did the review (which I did before).
vx::Union{Vector{Matrix{F}}, Nothing} | ||
vy::Union{Vector{Matrix{F}}, Nothing} | ||
vabs::Union{Vector{Matrix{F}}, Nothing} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Just curious: any reason why a vector of matrices is better than just a 3D tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's nicer to handle. Since two dimensions are spatial and one is temporal, I feel it's much nicer to iterate through time and get back spatial grids. Otherwise the user needs to know which of the 3 dimensions is time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the origin it was because the plotting functions assume that spatio-temporal data are a vector of matrix. I don't have a strong opinion except that it should be uniform across the different structs.
|
||
Constructor for ice surface velocity data based on Rabatel et. al (2023). | ||
|
||
|
||
Important remarks: | ||
- Projections in longitude and latitude assume we are working in the north hemisphere. | ||
If working with south hemisphere glaciers, this needs to be changed. | ||
- The error in velocity is unique per timestamp, rather than being pixel distributed. | ||
- The error in the absolute velocities `vabs_error` is overestimated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think right now we can just do north hemisphere unless we change this code:
# Spatial preprocessing
proj_zone = ncgetatt(file, "mapping", "utm_zone_number")
transform(X,Y) = Sleipnir.UTMercator(X, Y; zone=proj_zone, hemisphere=:north)
Right now this is done for north hemisphere. I think this was also something from the dataset: since the dataset for now is just for the north hemisphere, it is not as clear what will be the argument in the future from the dataset to determine if the datacube is north or south hemisphere. This is a good question for Romain/Antoine, but for now I will keep the comment until we are sure the preprocessing will work with South data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your quick and complete reviews @JordiBolibar @facusapienza21!
I'm answering the general comment here.
In this PR, @albangossard you implemented the first one, right? So now all the dataset is gridded in a raster with the same size than the target glacier. However, in principle the second approach is also valid, and I priori I was thinking in moving forward with doing that: just run the PDE on the grid and then make some interpolation to evaluate where there is data.
The main reason I implemented the first version and not the second is that it's more computationally intensive to grid the data on-the-fly. Right now it already takes a few seconds, and the interpolation is really simple. Although currently some of the time is spent reading on disk and that this is something that could be loaded in advance, I expect an on-the-fly gridding to significantly slow down the simulation. We would also have to truncate the observations to spare memory since the datacube is spatially larger than the size of the glacier (about only 1/4th of the spatial grid is useful for the Bossons glacier).
Furthermore, to evaluate a loss function we don't need the data to be gridded "in time" [...], so there is no reason to pack the data in discrete intervals of time: all this can be handled by the model itself when computing the loss function [...].
Note that with the current implementation, even if data are spatially gridded, there is no restriction about the time. The data are available at the same time steps as the ones of the datacube and we can i) take them all and store the dense solution, or ii) pick a few ones and compare them to the simulated speeds, or iii) average them as suggested by @JordiBolibar. All of this can be done afterwards by playing only with the time component.
I see these mappings in a very similar way as the gradient computation : we can split how the mapping is applied in time and in space.
We can definitely try this implementation first, but in principle both approaches are valid. Do you have any thoughts on this @albangossard or @JordiBolibar ?
From a mathematical perspective, I don't see any difference between the two as soon as the interpolation is properly implemented, which is obviously not the case right now since we're using the nearest points.
To end this remark, I agree that we will probably also implement the second option, at least to compare the two approaches. But in the long term my bet is that in practice we'll stick to the first version just because of the computational cost.
I'm happy to continue discussing about this! :)
if isdir(prepro_dir) | ||
daysSinceLastDownload = (Dates.now() - Dates.unix2datetime(mtime(prepro_dir)))/Day(1) | ||
if daysSinceLastDownload > 1 | ||
# Re-download data if older than one day |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would attempt to download the whole thing. But a cleaner way to do that would be to compute a checksum locally (or just read a version number) and compare it with a checksum on the server. If the two doesn't match, then we download the data. We could also do that silently with a try/catch so that if the user doesn't have an internet connection, it skips the downloading and uses the local version.
I created an issue for this specific task: #104
compute_vabs_error::Bool=true | ||
) where {G <: AbstractGlacier, VM <: VelocityMapping} | ||
|
||
velRast = isa(raster, String) ? RasterStack(raster, lazy=true) : raster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experience it's slower when we read the data. But the most important point is that for one of the datacubes I've been playing with, it is just impossible to load it a non lazy way. In the netcdf file data are represented in a sparse way on disk, but once it gets loaded into memory, everything is dense and this represents more than 15GB of RAM!
@@ -12,6 +12,7 @@ using JLD2 | |||
using Distributed | |||
using Statistics | |||
using CairoMakie | |||
using Observables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in lines like
Observables.connect!(cb.height, @lift CairoMakie.Fixed($(viewport(ax.scene)).widths[2]))
whose purpose is to make the colobar have the same height as the heatmap we are plotting.
@@ -13,6 +13,7 @@ CoordRefSystems = "b46f11dc-f210-4604-bfba-323c1ec968cb" | |||
DataStructures = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8" | |||
DateFormats = "44557152-fe0a-4de1-8405-416d90313ce6" | |||
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a" | |||
DimensionalData = "0703355e-b756-11e9-17c0-8b28908087d0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is just to access Dim
in Sleipnir/test/surface_velocity.jl
for example. If that bothers you we can probably do Rasters.Dim
instead.
- `date1::Union{Vector{DateTime}, Nothing}`: First date of adquisition. | ||
- `date2::Union{Vector{DateTime}, Nothing}`: Second date of adquisition. | ||
- `date_error::Union{Vector{Day}, Vector{Millisecond}, Nothing}`: Error in `date`. | ||
- `glacierGridded::Bool`: Whether the data have been gridded to the glacier grid or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we use the same struct to store the original raw data and the gridded ones. I don't think we need a specific struct to store the original raw data. Doing so will even come with many memory issues as here we load the data in a lazy mode only when we need them and many of the points in the original data aren't useful since they are outside of the glacier grid.
Most of the time, we probably won't use the raw data directly. However, it can be convenient in some cases for debugging or to visualize data of the netcdf files.
I changed the name to isGridGlacierAligned
to make it clearer.
abstract type VelocityMapping end | ||
|
||
""" | ||
MeanDateVelocityMapping <: VelocityMapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there are two different things that I mixed together: the spatial interpolation and the temporal interpolation. This is why the ice flow model is mentioned here, but we don't even care about it in Sleipnir. These are just structs to define what flavors of spatio-temporal mappings are implemented.
As discussed together today, I agree that these approaches are the two extremes in terms of complexity and we'll likely implement a simplification of the second one. We could either implement that in IntegratedTrajectoryMapping
by using several hyper-parameters for allows reducing the number of time steps that are used, or implement it in another struct that is a subtype of VelocityMapping
.
We'll probably implement these mappings in other PRs. All of this is exciting and I can't wait to start to play with these mappings!
vy = [eltype(x).(replace(vy[:,:,i], missing => NaN)) for i in 1:size(vy, 3)] | ||
vabs = [eltype(x).(replace(vabs[:,:,i], missing => NaN)) for i in 1:size(vabs, 3)] | ||
|
||
if mapToGlacierGrid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, I merged the two if
statements in a270f5f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good! Feel free to merge and probably bump version to start propagarting changes to Huginn and ODINN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, all good for me! :)
This PR contains multiple changes:
Results
without the keyword arguments. In Julia, we cannot define a function likebecause the keyword argument is not specialized during compilation. Such implementation would work if
a
was a positional argument. We didn't see this problem before because we were always providing a value toResults(...)
for the keyword arguments whose default values depend onF
.In order to avoid such bugs in the future, I added a unit test that instantiates
Results
.light
option of simulation was removed since we should usesave_everystep
instead.Results
instantiation and the plot functions.