Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Save only subset of obs space #57

Closed
Howuhh opened this issue Apr 7, 2023 · 15 comments
Closed

[Question] Save only subset of obs space #57

Howuhh opened this issue Apr 7, 2023 · 15 comments

Comments

@Howuhh
Copy link
Contributor

Howuhh commented Apr 7, 2023

Question

I am trying to build a dataset for the MiniHack environment. The environment has a fairly large dictionary space, with many keys, but only one level of nesting. I don't want to save all the keys, because it bloats the size of the dataset very much. I thought it could be done with StepDataCallback, like this:

class TTYStepDataCallback(minari.StepDataCallback):
    def __call__(self, env, obs, info, action=None, rew=None, terminated=None, truncated=None):
        tty_filter_keys = ["tty_chars", "tty_colors", "tty_cursor"]
        obs = {k: v for k, v in obs.items() if k in tty_filter_keys}

        step_data = super().__call__(env, obs, info, action, rew, terminated, truncated)
        
        return step_data

However, this will raise an error, as some keys from original observation space are missing. Why can't I just use the ones I want to save right away? I am logging data during distributed PPO training, which uses the other keys too, so they are needed for online training :(

@Howuhh
Copy link
Contributor Author

Howuhh commented Apr 8, 2023

For example if some obs has dtype of int64, while needed tty_* has int8, it will save it all flattened with in64 dtype. For dataset with 500k samples this will be 45GB in my case, for 10M - ~900GB. However, with int8 it will be of reasonable size.

@rodrigodelazcano
Copy link
Member

For example if some obs has dtype of int64, while needed tty_* has int8, it will save it all flattened with in64 dtype. For dataset with 500k samples this will be 45GB in my case, for 10M - ~900GB. However, with int8 it will be of reasonable size.

Is this because the obs space is of dtype int64 or because h5py is storing them with this type by default?

This is something we should look into, the problem is that the obs/actions are automatically flatten and for that we need to know the spaces https://gymnasium.farama.org/api/spaces/utils/#gymnasium.spaces.utils.flatten. We can make optional the flattening, the observations will be saved as a collection of separate datasets and we will need to modify the sampling function to return dictionaries.

What I don't like is having different spaces from the environment because of reproducibility. A solution can be having an action/observation space for the dataset and serialize them into json like the environment.

@Howuhh
Copy link
Contributor Author

Howuhh commented Apr 15, 2023

@rodrigodelazcano I think this is because some observations are of int64, so when flattening numpy will cast all other dtypes to int64. In this case it does not make a lot of sense to store it in another type in h5py, because then the accuracy of observations that need the int64 type will be lost.

A solution can be having an action/observation space for the dataset and serialize them into json like the environment

That's what was in my head when I imagined how it works currently. I think it is important, because all really complex environments (which benefit most from having a dataset) have non-trivial spaces and not always all of the keys need to be saved.

@rodrigodelazcano
Copy link
Member

I agree, we just need to figure out how to specify and store the obs/action spaces of the dataset and be able to retrieve them when loading the dataset. I'll make a PR with a possible implementation of this, let me know if you have any ideas.

Also, if we are doing this I like your idea of storing dictionary spaces without flattening #55 (comment). This is straight forward since Minari stores nested dictionaries into dataset collections. We can flatten the observations when sampling if required.

@Howuhh
Copy link
Contributor Author

Howuhh commented Apr 17, 2023

@rodrigodelazcano saving without flattening is also a good feature in my opinion, as unflattening huge observation spaces on the fly during sampling adds overhead to replay buffer sampling time (not that big, but it adds up over time)

@younik
Copy link
Member

younik commented Apr 21, 2023

The problem with saving without flattening is that you cannot use a NumPy buffer. This causes much more performance drop than reshaping (that it is just changing the metadata of the array).

To address the question of this issue, I think you can do it by specifying your observation_space differently.

Nevertheless, I see saving without flattening helpful or necessary with Sequence

@Howuhh
Copy link
Contributor Author

Howuhh commented Apr 22, 2023

@rodrigodelazcano Unfortunately, as I said above, I can't specify another observation space, because the method that collects data during training uses some keys, while I save other keys and they don't overlap. Therefore, it must remain as it is.

The problem with saving without flattening is that you cannot use a NumPy buffer.

I don't quite understand why this is the case. If you save some custom space, the user can always make his own buffer with separate numpy arrays for each modality, and then sample as usual. In the general case, however, the buffer can work with dictionaries of numpy arrays and will be agnostic about the space structure. There will be almost no overhead here, it is still the same sampling, at most will be added a small loop for ~3-10 steps on the number of keys (which are usually not so big).

What bothers me much more (though maybe I don't understand something), and why I have now decided not to use Minari further for my project (and write my own simplified wrapper), is that while flattening all data is casted to the same largest type. This can turn just a big dataset into absolutely massive and it makes a difference.

@rodrigodelazcano
Copy link
Member

rodrigodelazcano commented Apr 28, 2023

I understand, this makes sense to me and I didn't think about this issue when I first implemented it. The beta release will be out soon, afterwards we will set in our roadmap:

  • specify custom spaces for the datasets (different from env)
  • store unflatten obs/actions and maintain their dtype. Still I think we should have the option to flatten them during sampling.

Thanks a lot for the suggestions @Howuhh

@younik
Copy link
Member

younik commented Apr 28, 2023

the buffer can work with dictionaries of numpy arrays

You are right; it wasn't clear before, thank you for the idea!

@balisujohn
Copy link
Collaborator

balisujohn commented Jun 17, 2023

@Howuhh We should now support the pattern you need for minihack, please check this test https://github.com/Farama-Foundation/Minari/blob/main/tests/data_collector/callbacks/test_step_data_callback.py to see an example of the pattern, and let us know if it is sufficiently expressive for your use cases.

Moreover, we officially currently support Discrete Box Dict and Tuple spaces with arbitrary nesting, with plans to support Text spaces.

@balisujohn
Copy link
Collaborator

We will be issuing a minor release soon, once we have fixed a few of our first-party datasets to comply with the new standard, and slightly improve test coverage.

@Howuhh
Copy link
Contributor Author

Howuhh commented Jun 19, 2023

@balisujohn Cool! I think you need to fix the documentation for new format tho

@balisujohn
Copy link
Collaborator

We'll have that fixed shortly; the doc build failed.

@balisujohn
Copy link
Collaborator

@Howuhh https://minari.farama.org/main/content/dataset_standards/ The doc built from main now reflects the new dataset format.

@younik
Copy link
Member

younik commented Jul 10, 2023

This issue should be solved now; I close it. @Howuhh feel free to open it again if I am missing something

@younik younik closed this as completed Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants