[Question] Save only subset of obs space #57

Howuhh · 2023-04-07T20:14:10Z

Question

I am trying to build a dataset for the MiniHack environment. The environment has a fairly large dictionary space, with many keys, but only one level of nesting. I don't want to save all the keys, because it bloats the size of the dataset very much. I thought it could be done with StepDataCallback, like this:

class TTYStepDataCallback(minari.StepDataCallback):
    def __call__(self, env, obs, info, action=None, rew=None, terminated=None, truncated=None):
        tty_filter_keys = ["tty_chars", "tty_colors", "tty_cursor"]
        obs = {k: v for k, v in obs.items() if k in tty_filter_keys}

        step_data = super().__call__(env, obs, info, action, rew, terminated, truncated)
        
        return step_data

However, this will raise an error, as some keys from original observation space are missing. Why can't I just use the ones I want to save right away? I am logging data during distributed PPO training, which uses the other keys too, so they are needed for online training :(

Howuhh · 2023-04-08T21:28:03Z

For example if some obs has dtype of int64, while needed tty_* has int8, it will save it all flattened with in64 dtype. For dataset with 500k samples this will be 45GB in my case, for 10M - ~900GB. However, with int8 it will be of reasonable size.

rodrigodelazcano · 2023-04-13T23:58:16Z

For example if some obs has dtype of int64, while needed tty_* has int8, it will save it all flattened with in64 dtype. For dataset with 500k samples this will be 45GB in my case, for 10M - ~900GB. However, with int8 it will be of reasonable size.

Is this because the obs space is of dtype int64 or because h5py is storing them with this type by default?

This is something we should look into, the problem is that the obs/actions are automatically flatten and for that we need to know the spaces https://gymnasium.farama.org/api/spaces/utils/#gymnasium.spaces.utils.flatten. We can make optional the flattening, the observations will be saved as a collection of separate datasets and we will need to modify the sampling function to return dictionaries.

What I don't like is having different spaces from the environment because of reproducibility. A solution can be having an action/observation space for the dataset and serialize them into json like the environment.

Howuhh · 2023-04-15T15:11:20Z

@rodrigodelazcano I think this is because some observations are of int64, so when flattening numpy will cast all other dtypes to int64. In this case it does not make a lot of sense to store it in another type in h5py, because then the accuracy of observations that need the int64 type will be lost.

A solution can be having an action/observation space for the dataset and serialize them into json like the environment

That's what was in my head when I imagined how it works currently. I think it is important, because all really complex environments (which benefit most from having a dataset) have non-trivial spaces and not always all of the keys need to be saved.

rodrigodelazcano · 2023-04-16T18:03:40Z

I agree, we just need to figure out how to specify and store the obs/action spaces of the dataset and be able to retrieve them when loading the dataset. I'll make a PR with a possible implementation of this, let me know if you have any ideas.

Also, if we are doing this I like your idea of storing dictionary spaces without flattening #55 (comment). This is straight forward since Minari stores nested dictionaries into dataset collections. We can flatten the observations when sampling if required.

Howuhh · 2023-04-17T10:18:59Z

@rodrigodelazcano saving without flattening is also a good feature in my opinion, as unflattening huge observation spaces on the fly during sampling adds overhead to replay buffer sampling time (not that big, but it adds up over time)

younik · 2023-04-21T14:32:01Z

The problem with saving without flattening is that you cannot use a NumPy buffer. This causes much more performance drop than reshaping (that it is just changing the metadata of the array).

To address the question of this issue, I think you can do it by specifying your observation_space differently.

Nevertheless, I see saving without flattening helpful or necessary with Sequence

Howuhh · 2023-04-22T10:27:35Z

@rodrigodelazcano Unfortunately, as I said above, I can't specify another observation space, because the method that collects data during training uses some keys, while I save other keys and they don't overlap. Therefore, it must remain as it is.

The problem with saving without flattening is that you cannot use a NumPy buffer.

I don't quite understand why this is the case. If you save some custom space, the user can always make his own buffer with separate numpy arrays for each modality, and then sample as usual. In the general case, however, the buffer can work with dictionaries of numpy arrays and will be agnostic about the space structure. There will be almost no overhead here, it is still the same sampling, at most will be added a small loop for ~3-10 steps on the number of keys (which are usually not so big).

What bothers me much more (though maybe I don't understand something), and why I have now decided not to use Minari further for my project (and write my own simplified wrapper), is that while flattening all data is casted to the same largest type. This can turn just a big dataset into absolutely massive and it makes a difference.

rodrigodelazcano · 2023-04-28T14:14:49Z

I understand, this makes sense to me and I didn't think about this issue when I first implemented it. The beta release will be out soon, afterwards we will set in our roadmap:

specify custom spaces for the datasets (different from env)
store unflatten obs/actions and maintain their dtype. Still I think we should have the option to flatten them during sampling.

Thanks a lot for the suggestions @Howuhh

younik · 2023-04-28T19:25:22Z

the buffer can work with dictionaries of numpy arrays

You are right; it wasn't clear before, thank you for the idea!

balisujohn · 2023-06-17T23:14:50Z

@Howuhh We should now support the pattern you need for minihack, please check this test https://github.com/Farama-Foundation/Minari/blob/main/tests/data_collector/callbacks/test_step_data_callback.py to see an example of the pattern, and let us know if it is sufficiently expressive for your use cases.

Moreover, we officially currently support Discrete Box Dict and Tuple spaces with arbitrary nesting, with plans to support Text spaces.

balisujohn · 2023-06-17T23:16:32Z

We will be issuing a minor release soon, once we have fixed a few of our first-party datasets to comply with the new standard, and slightly improve test coverage.

Howuhh · 2023-06-19T13:19:08Z

@balisujohn Cool! I think you need to fix the documentation for new format tho

balisujohn · 2023-06-19T17:04:47Z

We'll have that fixed shortly; the doc build failed.

balisujohn · 2023-06-21T21:55:34Z

@Howuhh https://minari.farama.org/main/content/dataset_standards/ The doc built from main now reflects the new dataset format.

younik · 2023-07-10T14:15:34Z

This issue should be solved now; I close it. @Howuhh feel free to open it again if I am missing something

rodrigodelazcano mentioned this issue May 24, 2023

[Dataset Format Standard Modification] Record Unflattened Spaces in Datasets #77

Merged

11 tasks

rodrigodelazcano mentioned this issue Jul 6, 2023

Add PointMaze observation space subseting tutorial #108

Merged

7 tasks

younik closed this as completed Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Save only subset of obs space #57

[Question] Save only subset of obs space #57

Howuhh commented Apr 7, 2023 •

edited

Loading

Howuhh commented Apr 8, 2023 •

edited

Loading

rodrigodelazcano commented Apr 13, 2023

Howuhh commented Apr 15, 2023

rodrigodelazcano commented Apr 16, 2023

Howuhh commented Apr 17, 2023 •

edited

Loading

younik commented Apr 21, 2023

Howuhh commented Apr 22, 2023 •

edited

Loading

rodrigodelazcano commented Apr 28, 2023 •

edited

Loading

younik commented Apr 28, 2023

balisujohn commented Jun 17, 2023 •

edited

Loading

balisujohn commented Jun 17, 2023

Howuhh commented Jun 19, 2023

balisujohn commented Jun 19, 2023

balisujohn commented Jun 21, 2023

younik commented Jul 10, 2023

[Question] Save only subset of obs space #57

[Question] Save only subset of obs space #57

Comments

Howuhh commented Apr 7, 2023 • edited Loading

Question

Howuhh commented Apr 8, 2023 • edited Loading

rodrigodelazcano commented Apr 13, 2023

Howuhh commented Apr 15, 2023

rodrigodelazcano commented Apr 16, 2023

Howuhh commented Apr 17, 2023 • edited Loading

younik commented Apr 21, 2023

Howuhh commented Apr 22, 2023 • edited Loading

rodrigodelazcano commented Apr 28, 2023 • edited Loading

younik commented Apr 28, 2023

balisujohn commented Jun 17, 2023 • edited Loading

balisujohn commented Jun 17, 2023

Howuhh commented Jun 19, 2023

balisujohn commented Jun 19, 2023

balisujohn commented Jun 21, 2023

younik commented Jul 10, 2023

Howuhh commented Apr 7, 2023 •

edited

Loading

Howuhh commented Apr 8, 2023 •

edited

Loading

Howuhh commented Apr 17, 2023 •

edited

Loading

Howuhh commented Apr 22, 2023 •

edited

Loading

rodrigodelazcano commented Apr 28, 2023 •

edited

Loading

balisujohn commented Jun 17, 2023 •

edited

Loading