NOTE: Run all of the following steps from <project_dir>/multiencoder
.
pip install -r ../requirements.txt
To perform the preprocessing of QMSum necessary to reproduce the experiments, follow the instructions in the preprocessing directory.
To convert above files to a format that can be used by the Segment Encoder, run the following:
python convert_qmsum.py
The output files will be in data/qmsum/preprocessed
.
See scripts/train_qmsum_*.sh
bash scripts/select_checkpoints.sh
.
Copies best checkpoint for each run (based on mean validation rouge) to selected_checkpoint
directory.
bash scripts/predict_val.sh
Writes out val predictions for all selected checkpoints to selected_checkpoint/predictions.val
.
bash scripts/predict_test.sh
Writes out test predictions for all selected checkpoints to selected_checkpoint/predictions.test
.
bash scripts/report_rouge_val.sh
Reports mean rouge scores on validation set.
bash scripts/report_rouge_test.sh
Reports mean rouge scores on test set.
Note that these last scripts may prompt you with a small number of additional install steps.
We have provided checkpoints for our best performing QMSum-finetuned Segment Encoder model as reported in our paper (Table 5). The hyperparameters of note are:
- Input size: 16384
- Segment length: 512
- Segment overlap: 256
- Initial checkpoint: Wikisum-pretrained
We have included checkpoints for all 5 training runs of the model used in the final evaluation, along with their performance on the validation set:
Run | ROUGE-1 | ROUGE-2 | ROUGE-L | Checkpoint |
---|---|---|---|---|
1 | 38.85 | 13.00 | 34.13 | download |
2 | 38.50 | 12.87 | 33.92 | download |
3 | 38.66 | 13.01 | 34.07 | download |
4 | 38.16 | 12.90 | 33.73 | download |
5 | 38.74 | 12.81 | 34.08 | download |
To use a checkpoint, first download/untar it and then point the --model_name_or_path
command-line
argument in train.py to the top-level directory of the checkpoint. (See the
next section for examples of
using train.py to train/evaluate a model.) When using our provided checkpoint, also be sure to set the following arguments
as follows to be consistent with the fine-tuning hyperparameters:
--multiencoder_max_num_chunks 32 \
--multiencoder_stride \
--max_source_len 512
(For an explanation of the command-line arguments, see next section.)
The example below demonstrates how to evaluate a checkpoint against the validation set.
Note that you will first need to perform
Steps 1 and 2 from the previous section to populate the data/qmsum/preprocessed/
directory.
python train.py \
--do_predict \
--test_file data/qmsum/preprocessed/val.jsonl \
--model_name_or_path PATH_TO_CHECKPOINT \
--multiencoder_type bart \
--multiencoder_max_num_chunks 32 \
--multiencoder_stride \
--max_source_len 512 \
--output_dir PATH_TO_OUTPUT \
--generation_max_len 256 \
--val_max_target_length 256 \
--per_device_eval_batch_size 1 \
--predict_with_generate \
--prediction_path PATH_TO_PREDICTION_OUTPUT
Note: the ROUGE scores obtained from the above script (based on Huggingface ROUGE implementation) may differ slightly from those reported in the table above (based on SummEval ROUGE implementation, which is consistent with the paper). See discussion of these two implementations below.
The Segment Encoder data loaders expect a .jsonl
file, with each line in the following format:
{"source": <full source document>, "query": <optional query>, "target": <summary>}
You will need to execute train.py with the appropriate command-line arguments. Below is a template
for executing train.py based on the hyperparameters for the best-performing model (scripts/train_qmsum_16_512_strided.sh).
You will need to set train_file
and validation_file
to point to .jsonl
files in the format described in Step 1, and output_dir
to point to the directory where the model checkpoints will be saved.
python train.py \
--do_train \
--train_file PATH_TO_TRAIN_FILE \
--do_eval \
--validation_file PATH_TO_VALIDATION_FILE \
--model_name_or_path facebook/bart-large \
--multiencoder_type bart \
--multiencoder_max_num_chunks 32 \
--multiencoder_stride \
--max_source_len 512 \
--learning_rate 0.000005 \
--save_strategy epoch \
--num_train_epochs 10 \
--gradient_checkpointing \
--output_dir PATH_TO_SAVE_MODEL \
--per_device_train_batch_size 1 \
--generation_max_len 256 \
--val_max_target_length 256 \
--evaluation_strategy epoch \
--per_device_eval_batch_size 1 \
--metric_for_best_model eval_mean_rouge \
--compute_rouge_for_train \
--predict_with_generate \
--logging_strategy epoch \
--load_best_model_at_end \
--seed 1
Argument descriptions:
do_train
: Required boolean flagtrain_file
: Path to your training file (in.jsonl
format described above).do_eval
: Boolean flag to evaluate model on validation set during trainingvalidation_file
: Path to your optional validation file (in.jsonl
format described above)model_name_or_path
: Name of or path to Huggingface model (recommendfacebook/bart-large
). Currently only supports BART checkpoints.multiencoder_type
: Set tobart
multiencoder_max_num_chunks
: Number of segmentsmultiencoder_stride
: Boolean flag to use 50%-overlap strides in segmentation. If not set, segments will be disjoint, which may degrade model performance.max_source_len
: Segment lengthlearning_rate
: Learning rate (recommend 0.000005 if replicating paper experiments)save_strategy
: Set toepoch
to save checkpoint at end of each epochnum_train_epochs
: Number of epochsgradient_checkpointing
(recommended for larger models): Boolean flag to turn on gradient checkpointing, which reduces memory footprint and increases compute. This may be necessary for some models depending on number of segments, size of segments, and GPU memory available.output_dir
: Output directory for saved model checkpoints and logsper_device_train_batch_size
: Batch size, typically 1 for larger modelsgeneration_max_len
andval_max_target_length
: Set to the maximum target lengthevaluation_strategy
: Set toepoch
if you wish to evaluate at the end of each epochper_device_eval_batch_size
: Evaluation batch size, typically 1 for larger modelsmetric_for_best_model
(see alsocompute_rouge_for_train
andpredict_with_generate
below): Set toeval_mean_rouge
(recommended) if you wish use mean rouge as criterion for selecting checkpoint. Leave off to use cross entropy.compute_rouge_for_train
: Include if you wish compute rouge as part of the eval in training (necessary ifmetric_for_best_model
=eval_mean_rouge
)predict_with_generate
: Required boolean flag ifcompute_rouge_for_train
set to Truelogging_strategy
: Set toepoch
to log results at end of each epochoverwrite_output_dir
: Boolean flag to overwrite output directory with multiple runsload_best_model_at_end
: Boolean flag to load the best checkpoint at the endseed
: Optional random seed- Optionally, other arguments for the Huggingface Seq2SeqTrainer specified in Seq2SeqTrainingArguments
See train.py for documentation on other arguments. Note that train.py is based on the standard HuggingFace training script for summarization, and uses many of the same command-line arguments.
There are two main options for evaluation, described below.
This relies on datasets.load_metric()
.
Run train.py with appropriate arguments for testing. Example template consistent with training template from Step 2:
python train.py \
--do_predict \
--test_file PATH_TO_TEST_FILE \
--model_name_or_path PATH_TO_SAVE_MODEL \
--multiencoder_type bart \
--multiencoder_max_num_chunks 32 \
--multiencoder_stride \
--max_source_len 512 \
--output_dir PATH_TO_TEST_OUTPUT \
--generation_max_len 256 \
--val_max_target_length 256 \
--per_device_eval_batch_size 1 \
--predict_with_generate \
--prediction_path PATH_TO_PREDICTION_OUTPUT
You will need to set test_file
to a test file in the .jsonl
format described in Step 1. Set model_name_or_path
to the
top-level PATH_TO_SAVE_MODEL
specified in the training script; this top-level directory has the best-performing checkpoint
according to the metric_for_best_model
argument to the training script. Set output_dir
to the directory where testing outputs will go and prediction_path
to the file where generated predictions will go.
If you change any model parameters in the training
script be sure to update corresponding arguments in the test script (e.g. number of segments, segment length).
The SummEval implementation uses the original PERL script for computing rouge.
To run this, you will need to first run the test script above, and then additionally run
report_rouge.py
based on the generated predictions from the test script. You
can see examples of this in steps 5-6 in the Reproducing Experiments section.