-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Save only subset of obs space #57
Comments
For example if some obs has dtype of int64, while needed tty_* has int8, it will save it all flattened with in64 dtype. For dataset with 500k samples this will be 45GB in my case, for 10M - ~900GB. However, with int8 it will be of reasonable size. |
Is this because the obs space is of dtype int64 or because h5py is storing them with this type by default? This is something we should look into, the problem is that the obs/actions are automatically flatten and for that we need to know the spaces https://gymnasium.farama.org/api/spaces/utils/#gymnasium.spaces.utils.flatten. We can make optional the flattening, the observations will be saved as a collection of separate datasets and we will need to modify the sampling function to return dictionaries. What I don't like is having different spaces from the environment because of reproducibility. A solution can be having an action/observation space for the dataset and serialize them into json like the environment. |
@rodrigodelazcano I think this is because some observations are of int64, so when flattening numpy will cast all other dtypes to int64. In this case it does not make a lot of sense to store it in another type in h5py, because then the accuracy of observations that need the int64 type will be lost.
That's what was in my head when I imagined how it works currently. I think it is important, because all really complex environments (which benefit most from having a dataset) have non-trivial spaces and not always all of the keys need to be saved. |
I agree, we just need to figure out how to specify and store the obs/action spaces of the dataset and be able to retrieve them when loading the dataset. I'll make a PR with a possible implementation of this, let me know if you have any ideas. Also, if we are doing this I like your idea of storing dictionary spaces without flattening #55 (comment). This is straight forward since Minari stores nested dictionaries into dataset collections. We can flatten the observations when sampling if required. |
@rodrigodelazcano saving without flattening is also a good feature in my opinion, as unflattening huge observation spaces on the fly during sampling adds overhead to replay buffer sampling time (not that big, but it adds up over time) |
The problem with saving without flattening is that you cannot use a NumPy buffer. This causes much more performance drop than reshaping (that it is just changing the metadata of the array). To address the question of this issue, I think you can do it by specifying your Nevertheless, I see saving without flattening helpful or necessary with |
@rodrigodelazcano Unfortunately, as I said above, I can't specify another observation space, because the method that collects data during training uses some keys, while I save other keys and they don't overlap. Therefore, it must remain as it is.
I don't quite understand why this is the case. If you save some custom space, the user can always make his own buffer with separate numpy arrays for each modality, and then sample as usual. In the general case, however, the buffer can work with dictionaries of numpy arrays and will be agnostic about the space structure. There will be almost no overhead here, it is still the same sampling, at most will be added a small loop for ~3-10 steps on the number of keys (which are usually not so big). What bothers me much more (though maybe I don't understand something), and why I have now decided not to use Minari further for my project (and write my own simplified wrapper), is that while flattening all data is casted to the same largest type. This can turn just a big dataset into absolutely massive and it makes a difference. |
I understand, this makes sense to me and I didn't think about this issue when I first implemented it. The beta release will be out soon, afterwards we will set in our roadmap:
Thanks a lot for the suggestions @Howuhh |
You are right; it wasn't clear before, thank you for the idea! |
@Howuhh We should now support the pattern you need for minihack, please check this test https://github.com/Farama-Foundation/Minari/blob/main/tests/data_collector/callbacks/test_step_data_callback.py to see an example of the pattern, and let us know if it is sufficiently expressive for your use cases. Moreover, we officially currently support |
We will be issuing a minor release soon, once we have fixed a few of our first-party datasets to comply with the new standard, and slightly improve test coverage. |
@balisujohn Cool! I think you need to fix the documentation for new format tho |
We'll have that fixed shortly; the doc build failed. |
@Howuhh https://minari.farama.org/main/content/dataset_standards/ The doc built from |
This issue should be solved now; I close it. @Howuhh feel free to open it again if I am missing something |
Question
I am trying to build a dataset for the MiniHack environment. The environment has a fairly large dictionary space, with many keys, but only one level of nesting. I don't want to save all the keys, because it bloats the size of the dataset very much. I thought it could be done with StepDataCallback, like this:
However, this will raise an error, as some keys from original observation space are missing. Why can't I just use the ones I want to save right away? I am logging data during distributed PPO training, which uses the other keys too, so they are needed for online training :(
The text was updated successfully, but these errors were encountered: