Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Training never starts on TFT/Progress bar not working #830

Closed
strakehyr opened this issue Mar 1, 2022 · 17 comments
Closed

[BUG] Training never starts on TFT/Progress bar not working #830

strakehyr opened this issue Mar 1, 2022 · 17 comments
Labels
bug Something isn't working q&a Frequent question & answer triage Issue waiting for triaging

Comments

@strakehyr
Copy link

Describe the bug
I use a dataset composed of 20 features and a single target. All of the features are future covariates. I use target past as well as the features's history as past covariates. To covariates, I add datetime attributes of year, month, day of week, hour, and holidays. The dataset has several years of hourly data, however I tried cutting down the samples to check if it made a difference. I am succesfully using the same dataset on other models (not from DARTS) and getting good results.

To Reproduce

train_ratio = 0.90
look_back = 192
horizon = 192
n_outputs = 1

df = pd.read_csv(file_path, index_col = 0)

training_cutoff = pd.Timestamp(df['Time'].iloc[round(len(df)*train_ratio)])
series = TimeSeries.from_dataframe(df, 'Time', value_cols = df.columns[1:])
train, val = series.split_after(training_cutoff)

scaler = Scaler()
train_transformed = scaler.fit_transform(train)
val_transformed = scaler.transform(val)
series_transformed = scaler.transform(series)

trgt_scaler = Scaler()
trgt_transformed = trgt_scaler.fit_transform(series['target'])

covariates = datetime_attribute_timeseries(series, attribute='year', one_hot=False)
covariates = covariates.stack(datetime_attribute_timeseries(series, attribute='month', one_hot=False))
covariates = covariates.stack(datetime_attribute_timeseries(series, attribute='day_of_week', one_hot=False))
covariates = covariates.stack(datetime_attribute_timeseries(series, attribute='hour', one_hot=False))
covariates = covariates.add_holidays(country)
f_covariates = covariates.stack(TimeSeries.from_times_and_values(times=series.time_index, 
                                                               values=df.iloc[:, 1+n_outputs:].to_numpy(), 
                                                               columns=series.columns[n_outputs:]))
p_covariates = covariates.stack(TimeSeries.from_times_and_values(times=series.time_index, 
                                                               values=df.iloc[:, 1:].to_numpy(), 
                                                               columns=series.columns))

scaler_f_covs = Scaler()
f_cov_train, f_cov_val = f_covariates.split_after(training_cutoff)
scaler_f_covs.fit(f_cov_train)
f_covariates_transformed = scaler_f_covs.transform(f_covariates)

scaler_p_covs = Scaler()
p_cov_train, p_cov_val = p_covariates.split_after(training_cutoff)
scaler_p_covs.fit(p_cov_train)
p_covariates_transformed = scaler_p_covs.transform(p_covariates)

quantiles = [
     0.1, 0.25, 0.5, 0.75, 0.9
]
model = TFTModel(input_chunk_length=look_back,
                    output_chunk_length=horizon,
                    hidden_size=32,
                    lstm_layers=1,
                    full_attention = True,
                    dropout = 0.1,
                    num_attention_heads=4,
                    batch_size=32,
                    n_epochs=250,
                    add_relative_index=False,
                    add_encoders=None,
                    #likelihood=None,
                    #loss_fn=MSELoss(),
                    likelihood=QuantileRegression(quantiles=quantiles),  # QuantileRegression is set per default
                    force_reset=True,
                    pl_trainer_kwargs = {"accelerator": "gpu", "gpus": [0], 
                                         "enable_progress_bar" : True, "enable_model_summary" : True},
                    optimizer_cls = torch.optim.SGD,
                    optimizer_kwargs = {'lr':0.01})

model.fit(train_transformed['target'],
         future_covariates=f_covariates_transformed,
         past_covariates=p_covariates_transformed)


Expected behavior
Training starts but it gets stuck. It never ends a single epoch.

System:

  • Python version: [ 3.9]
  • darts version [ 0.17.0]
@strakehyr strakehyr added bug Something isn't working triage Issue waiting for triaging labels Mar 1, 2022
@hrzn
Copy link
Contributor

hrzn commented Mar 1, 2022

Hi, do you see any progress with the Pytorch Lightning progress bar, or it's not moving at all?

@strakehyr
Copy link
Author

Hi, do you see any progress with the Pytorch Lightning progress bar, or it's not moving at all?

It's not moving at all.

@dennisbader
Copy link
Collaborator

I could not reproduce the issue (on version 0.17.1). Could you try running it on CPU instead of GPU? With

pl_trainer_kwargs={
    "accelerator": "cpu", 
    "gpus": None, 
    "auto_select_gpus": False,
    "enable_progress_bar" : True, 
    "enable_model_summary" : True
}

Would also interest me if you still get the issue in version 0.17.1

@strakehyr
Copy link
Author

Hi,
I have tried both options, running with CPU, and on the new version 0.17.1 with both GPU and CPU. With all the options it gets stuck in the beginning like so:

140 K     Trainable params
0         Non-trainable params
140 K     Total params
1.126     Total estimated model params size (MB)
Training: 0it [00:00, ?it/s]

Is this probably too many parameters? Does it usually work smoothly on a model this big? Probably too many samples.

@dennisbader
Copy link
Collaborator

The model trains on batches, and the progress bar should by default get updated after every batch rather than after an epoch. This is customizable through pl_trainer_kwarg with key "progress_bar_refresh_rate".

How large is your dataset? Do you have any memory issues?

@strakehyr
Copy link
Author

How large is your dataset? Do you have any memory issues?

it has over 31k samples for 22 features and a single target, all set for 200 timesteps lookback and 200 timesteps horizon. The dataset works perfectly and quick with other libraries using pytorch, so I don't think there is a memory issue. Also, I have tried cutting down on features and samples and it still does not work.

@hrzn
Copy link
Contributor

hrzn commented Mar 21, 2022

31k training series may represent a huge amount of training samples (the input/output subslices obtained from your series) by default. Could you try limiting the number of training samples by providing e.g., max_samples_per_ts=1 to the fit() function and check if the problem persists. This will trivially limit the number of training samples by considering only the last (input, output) slice of each series.

@strakehyr
Copy link
Author

It's composed of 31k timestamps, not series. As I stated I already tried slicing the features, as well as the samples, and it does not seem to make a difference.

@hrzn
Copy link
Contributor

hrzn commented Mar 21, 2022

@strakehyr Do you see any training dataset length being printed when you launch the training (before it hangs)?

@hrzn
Copy link
Contributor

hrzn commented Mar 21, 2022

It's composed of 31k timestamps, not series. As I stated I already tried slicing the features, as well as the samples, and it does not seem to make a difference.

Can you still try to use max_samples_per_ts=1 in the fit() call and tell if the problem persists?

@strakehyr
Copy link
Author

@strakehyr Do you see any training dataset length being printed when you launch the training (before it hangs)?

The following gets printed (sliced dataset):

[2022-03-21 10:47:02,670] INFO | darts.models.forecasting.torch_forecasting_model | Train dataset contains 5259 samples.
[2022-03-21 10:47:02,670] INFO | darts.models.forecasting.torch_forecasting_model | Train dataset contains 5259 samples.
INFO:darts.models.forecasting.torch_forecasting_model:Train dataset contains 5259 samples.
[2022-03-21 10:47:02,716] INFO | darts.models.forecasting.torch_forecasting_model | Time series values are 64-bits; casting model to float64.
[2022-03-21 10:47:02,716] INFO | darts.models.forecasting.torch_forecasting_model | Time series values are 64-bits; casting model to float64.
INFO:darts.models.forecasting.torch_forecasting_model:Time series values are 64-bits; casting model to float64.
INFO:pytorch_lightning.utilities.distributed:GPU available: True, used: True
INFO:pytorch_lightning.utilities.distributed:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.distributed:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
   | Name                              | Type                             | Params
----------------------------------------------------------------------------------------
0  | static_covariates_vsn             | _VariableSelectionNetwork        | 0     
1  | encoder_vsn                       | _VariableSelectionNetwork        | 60.8 K
2  | decoder_vsn                       | _VariableSelectionNetwork        | 27.4 K
3  | static_context_grn                | _GatedResidualNetwork            | 4.3 K 
4  | static_context_hidden_encoder_grn | _GatedResidualNetwork            | 4.3 K 
5  | static_context_cell_encoder_grn   | _GatedResidualNetwork            | 4.3 K 
6  | static_context_enrichment         | _GatedResidualNetwork            | 4.3 K 
7  | lstm_encoder                      | LSTM                             | 8.4 K 
8  | lstm_decoder                      | LSTM                             | 8.4 K 
9  | post_lstm_gan                     | _GateAddNorm                     | 2.2 K 
10 | static_enrichment_grn             | _GatedResidualNetwork            | 5.3 K 
11 | multihead_attn                    | _InterpretableMultiHeadAttention | 2.6 K 
12 | post_attn_gan                     | _GateAddNorm                     | 2.2 K 
13 | positionwise_feedforward_grn      | _GatedResidualNetwork            | 4.3 K 
14 | pre_output_gan                    | _GateAddNorm                     | 2.2 K 
15 | output_layer                      | Linear                           | 561   
----------------------------------------------------------------------------------------
141 K     Trainable params
0         Non-trainable params
141 K     Total params
1.129     Total estimated model params size (MB)
Training: 0it [00:00, ?it/s]

Can you still try to use max_samples_per_ts=1 in the fit() call and tell if the problem persists?

When attempting this, it returns:

<darts.models.forecasting.tft_model.TFTModel at 0x19277e910d0>

However, no training parameters are shown, no loss, no epochs, nothing. Just returns that.

@strakehyr
Copy link
Author

I believe the problem is caused by something in the configuration, as I am trying to perform the Air Passenger example (https://unit8co.github.io/darts/examples/13-TFT-examples.html?highlight=tft#Air-Passenger-Example) and I am running into the same (getting <darts.models.forecasting.tft_model.TFTModel at 0x19277807ee0>).

Any idea what the root might possibly be?

@dennisbader
Copy link
Collaborator

I think I found a solution to this issue. It seems to be an ipywidgets issue when using jupyter notebook, see here.

Following ipywidgets docs see here running the following command from bash in the env with the darts package fixed it for me.

jupyter nbextension enable --py widgetsnbextension

Could anyone test this to confirm?

@strakehyr
Copy link
Author

I think I found a solution to this issue. It seems to be an ipywidgets issue when using jupyter notebook, see here.

Following ipywidgets docs see here running the following command from bash in the env with the darts package fixed it for me.

jupyter nbextension enable --py widgetsnbextension

Could anyone test this to confirm?

I use an IDE for my code, so I doubt this was the issue.

@dennisbader
Copy link
Collaborator

True, in that case could you try uninstalling ipywidgets?

@strakehyr
Copy link
Author

jupyter nbextension enable --py widgetsnbextension

Thank you, this completely solved it.

@dennisbader
Copy link
Collaborator

Small update: below should work as well and show the progress bar as intended.

pip install -U ipywidgets

@dennisbader dennisbader added the q&a Frequent question & answer label Sep 28, 2023
@dennisbader dennisbader changed the title [BUG] Training never starts on TFT [BUG] Training never starts on TFT/Progress bar not working Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working q&a Frequent question & answer triage Issue waiting for triaging
Projects
None yet
Development

No branches or pull requests

3 participants