[BUG] Training never starts on TFT/Progress bar not working #830

strakehyr · 2022-03-01T14:21:59Z

Describe the bug
I use a dataset composed of 20 features and a single target. All of the features are future covariates. I use target past as well as the features's history as past covariates. To covariates, I add datetime attributes of year, month, day of week, hour, and holidays. The dataset has several years of hourly data, however I tried cutting down the samples to check if it made a difference. I am succesfully using the same dataset on other models (not from DARTS) and getting good results.

To Reproduce

train_ratio = 0.90
look_back = 192
horizon = 192
n_outputs = 1

df = pd.read_csv(file_path, index_col = 0)

training_cutoff = pd.Timestamp(df['Time'].iloc[round(len(df)*train_ratio)])
series = TimeSeries.from_dataframe(df, 'Time', value_cols = df.columns[1:])
train, val = series.split_after(training_cutoff)

scaler = Scaler()
train_transformed = scaler.fit_transform(train)
val_transformed = scaler.transform(val)
series_transformed = scaler.transform(series)

trgt_scaler = Scaler()
trgt_transformed = trgt_scaler.fit_transform(series['target'])

covariates = datetime_attribute_timeseries(series, attribute='year', one_hot=False)
covariates = covariates.stack(datetime_attribute_timeseries(series, attribute='month', one_hot=False))
covariates = covariates.stack(datetime_attribute_timeseries(series, attribute='day_of_week', one_hot=False))
covariates = covariates.stack(datetime_attribute_timeseries(series, attribute='hour', one_hot=False))
covariates = covariates.add_holidays(country)
f_covariates = covariates.stack(TimeSeries.from_times_and_values(times=series.time_index, 
                                                               values=df.iloc[:, 1+n_outputs:].to_numpy(), 
                                                               columns=series.columns[n_outputs:]))
p_covariates = covariates.stack(TimeSeries.from_times_and_values(times=series.time_index, 
                                                               values=df.iloc[:, 1:].to_numpy(), 
                                                               columns=series.columns))

scaler_f_covs = Scaler()
f_cov_train, f_cov_val = f_covariates.split_after(training_cutoff)
scaler_f_covs.fit(f_cov_train)
f_covariates_transformed = scaler_f_covs.transform(f_covariates)

scaler_p_covs = Scaler()
p_cov_train, p_cov_val = p_covariates.split_after(training_cutoff)
scaler_p_covs.fit(p_cov_train)
p_covariates_transformed = scaler_p_covs.transform(p_covariates)

quantiles = [
     0.1, 0.25, 0.5, 0.75, 0.9
]
model = TFTModel(input_chunk_length=look_back,
                    output_chunk_length=horizon,
                    hidden_size=32,
                    lstm_layers=1,
                    full_attention = True,
                    dropout = 0.1,
                    num_attention_heads=4,
                    batch_size=32,
                    n_epochs=250,
                    add_relative_index=False,
                    add_encoders=None,
                    #likelihood=None,
                    #loss_fn=MSELoss(),
                    likelihood=QuantileRegression(quantiles=quantiles),  # QuantileRegression is set per default
                    force_reset=True,
                    pl_trainer_kwargs = {"accelerator": "gpu", "gpus": [0], 
                                         "enable_progress_bar" : True, "enable_model_summary" : True},
                    optimizer_cls = torch.optim.SGD,
                    optimizer_kwargs = {'lr':0.01})

model.fit(train_transformed['target'],
         future_covariates=f_covariates_transformed,
         past_covariates=p_covariates_transformed)

Expected behavior
Training starts but it gets stuck. It never ends a single epoch.

System:

Python version: [ 3.9]
darts version [ 0.17.0]

The text was updated successfully, but these errors were encountered:

hrzn · 2022-03-01T19:50:51Z

Hi, do you see any progress with the Pytorch Lightning progress bar, or it's not moving at all?

strakehyr · 2022-03-02T09:32:31Z

Hi, do you see any progress with the Pytorch Lightning progress bar, or it's not moving at all?

It's not moving at all.

dennisbader · 2022-03-07T19:58:03Z

I could not reproduce the issue (on version 0.17.1). Could you try running it on CPU instead of GPU? With

pl_trainer_kwargs={
    "accelerator": "cpu", 
    "gpus": None, 
    "auto_select_gpus": False,
    "enable_progress_bar" : True, 
    "enable_model_summary" : True
}

Would also interest me if you still get the issue in version 0.17.1

strakehyr · 2022-03-14T07:47:22Z

Hi,
I have tried both options, running with CPU, and on the new version 0.17.1 with both GPU and CPU. With all the options it gets stuck in the beginning like so:

140 K     Trainable params
0         Non-trainable params
140 K     Total params
1.126     Total estimated model params size (MB)
Training: 0it [00:00, ?it/s]

Is this probably too many parameters? Does it usually work smoothly on a model this big? Probably too many samples.

dennisbader · 2022-03-19T11:25:01Z

The model trains on batches, and the progress bar should by default get updated after every batch rather than after an epoch. This is customizable through pl_trainer_kwarg with key "progress_bar_refresh_rate".

How large is your dataset? Do you have any memory issues?

strakehyr · 2022-03-21T08:13:11Z

How large is your dataset? Do you have any memory issues?

it has over 31k samples for 22 features and a single target, all set for 200 timesteps lookback and 200 timesteps horizon. The dataset works perfectly and quick with other libraries using pytorch, so I don't think there is a memory issue. Also, I have tried cutting down on features and samples and it still does not work.

hrzn · 2022-03-21T08:54:57Z

31k training series may represent a huge amount of training samples (the input/output subslices obtained from your series) by default. Could you try limiting the number of training samples by providing e.g., max_samples_per_ts=1 to the fit() function and check if the problem persists. This will trivially limit the number of training samples by considering only the last (input, output) slice of each series.

strakehyr · 2022-03-21T09:00:52Z

It's composed of 31k timestamps, not series. As I stated I already tried slicing the features, as well as the samples, and it does not seem to make a difference.

hrzn · 2022-03-21T09:02:43Z

@strakehyr Do you see any training dataset length being printed when you launch the training (before it hangs)?

hrzn · 2022-03-21T09:11:21Z

It's composed of 31k timestamps, not series. As I stated I already tried slicing the features, as well as the samples, and it does not seem to make a difference.

Can you still try to use max_samples_per_ts=1 in the fit() call and tell if the problem persists?

strakehyr · 2022-03-21T09:53:10Z

@strakehyr Do you see any training dataset length being printed when you launch the training (before it hangs)?

The following gets printed (sliced dataset):

[2022-03-21 10:47:02,670] INFO | darts.models.forecasting.torch_forecasting_model | Train dataset contains 5259 samples.
[2022-03-21 10:47:02,670] INFO | darts.models.forecasting.torch_forecasting_model | Train dataset contains 5259 samples.
INFO:darts.models.forecasting.torch_forecasting_model:Train dataset contains 5259 samples.
[2022-03-21 10:47:02,716] INFO | darts.models.forecasting.torch_forecasting_model | Time series values are 64-bits; casting model to float64.
[2022-03-21 10:47:02,716] INFO | darts.models.forecasting.torch_forecasting_model | Time series values are 64-bits; casting model to float64.
INFO:darts.models.forecasting.torch_forecasting_model:Time series values are 64-bits; casting model to float64.
INFO:pytorch_lightning.utilities.distributed:GPU available: True, used: True
INFO:pytorch_lightning.utilities.distributed:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.distributed:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
   | Name                              | Type                             | Params
----------------------------------------------------------------------------------------
0  | static_covariates_vsn             | _VariableSelectionNetwork        | 0     
1  | encoder_vsn                       | _VariableSelectionNetwork        | 60.8 K
2  | decoder_vsn                       | _VariableSelectionNetwork        | 27.4 K
3  | static_context_grn                | _GatedResidualNetwork            | 4.3 K 
4  | static_context_hidden_encoder_grn | _GatedResidualNetwork            | 4.3 K 
5  | static_context_cell_encoder_grn   | _GatedResidualNetwork            | 4.3 K 
6  | static_context_enrichment         | _GatedResidualNetwork            | 4.3 K 
7  | lstm_encoder                      | LSTM                             | 8.4 K 
8  | lstm_decoder                      | LSTM                             | 8.4 K 
9  | post_lstm_gan                     | _GateAddNorm                     | 2.2 K 
10 | static_enrichment_grn             | _GatedResidualNetwork            | 5.3 K 
11 | multihead_attn                    | _InterpretableMultiHeadAttention | 2.6 K 
12 | post_attn_gan                     | _GateAddNorm                     | 2.2 K 
13 | positionwise_feedforward_grn      | _GatedResidualNetwork            | 4.3 K 
14 | pre_output_gan                    | _GateAddNorm                     | 2.2 K 
15 | output_layer                      | Linear                           | 561   
----------------------------------------------------------------------------------------
141 K     Trainable params
0         Non-trainable params
141 K     Total params
1.129     Total estimated model params size (MB)
Training: 0it [00:00, ?it/s]

Can you still try to use max_samples_per_ts=1 in the fit() call and tell if the problem persists?

When attempting this, it returns:

<darts.models.forecasting.tft_model.TFTModel at 0x19277e910d0>

However, no training parameters are shown, no loss, no epochs, nothing. Just returns that.

strakehyr · 2022-03-21T10:13:37Z

I believe the problem is caused by something in the configuration, as I am trying to perform the Air Passenger example (https://unit8co.github.io/darts/examples/13-TFT-examples.html?highlight=tft#Air-Passenger-Example) and I am running into the same (getting <darts.models.forecasting.tft_model.TFTModel at 0x19277807ee0>).

Any idea what the root might possibly be?

dennisbader · 2022-07-17T14:23:08Z

I think I found a solution to this issue. It seems to be an ipywidgets issue when using jupyter notebook, see here.

Following ipywidgets docs see here running the following command from bash in the env with the darts package fixed it for me.

jupyter nbextension enable --py widgetsnbextension

Could anyone test this to confirm?

strakehyr · 2022-07-17T16:58:56Z

I think I found a solution to this issue. It seems to be an ipywidgets issue when using jupyter notebook, see here.

Following ipywidgets docs see here running the following command from bash in the env with the darts package fixed it for me.
jupyter nbextension enable --py widgetsnbextension
Could anyone test this to confirm?

I use an IDE for my code, so I doubt this was the issue.

dennisbader · 2022-07-17T17:05:09Z

True, in that case could you try uninstalling ipywidgets?

strakehyr · 2022-07-18T07:07:11Z

jupyter nbextension enable --py widgetsnbextension

Thank you, this completely solved it.

dennisbader · 2023-09-28T07:22:21Z

Small update: below should work as well and show the progress bar as intended.

pip install -U ipywidgets

strakehyr added bug Something isn't working triage Issue waiting for triaging labels Mar 1, 2022

dennisbader mentioned this issue Jul 17, 2022

[BUG] NBEATSModel | Progress bar is not working #856

Closed

strakehyr closed this as completed Jul 18, 2022

dennisbader mentioned this issue Nov 8, 2022

[BUG] Model.fit printing Validation: 0it [00:00, ?it/s] on new lines for every epoch (or x epochs)(no bar progression) #1342

Closed

hnadeem2 mentioned this issue May 23, 2023

[BUG] Training does not start #1796

Closed

dennisbader added the q&a Frequent question & answer label Sep 28, 2023

dennisbader changed the title ~~[BUG] Training never starts on TFT~~ [BUG] Training never starts on TFT/Progress bar not working Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Training never starts on TFT/Progress bar not working #830

[BUG] Training never starts on TFT/Progress bar not working #830

strakehyr commented Mar 1, 2022

hrzn commented Mar 1, 2022

strakehyr commented Mar 2, 2022

dennisbader commented Mar 7, 2022

strakehyr commented Mar 14, 2022

dennisbader commented Mar 19, 2022

strakehyr commented Mar 21, 2022

hrzn commented Mar 21, 2022 •

edited

Loading

strakehyr commented Mar 21, 2022

hrzn commented Mar 21, 2022 •

edited

Loading

hrzn commented Mar 21, 2022

strakehyr commented Mar 21, 2022

strakehyr commented Mar 21, 2022

dennisbader commented Jul 17, 2022

strakehyr commented Jul 17, 2022

dennisbader commented Jul 17, 2022

strakehyr commented Jul 18, 2022

dennisbader commented Sep 28, 2023

[BUG] Training never starts on TFT/Progress bar not working #830

[BUG] Training never starts on TFT/Progress bar not working #830

Comments

strakehyr commented Mar 1, 2022

hrzn commented Mar 1, 2022

strakehyr commented Mar 2, 2022

dennisbader commented Mar 7, 2022

strakehyr commented Mar 14, 2022

dennisbader commented Mar 19, 2022

strakehyr commented Mar 21, 2022

hrzn commented Mar 21, 2022 • edited Loading

strakehyr commented Mar 21, 2022

hrzn commented Mar 21, 2022 • edited Loading

hrzn commented Mar 21, 2022

strakehyr commented Mar 21, 2022

strakehyr commented Mar 21, 2022

dennisbader commented Jul 17, 2022

strakehyr commented Jul 17, 2022

dennisbader commented Jul 17, 2022

strakehyr commented Jul 18, 2022

dennisbader commented Sep 28, 2023

hrzn commented Mar 21, 2022 •

edited

Loading

hrzn commented Mar 21, 2022 •

edited

Loading