Skip to content

Latest commit



261 lines (204 loc) · 12 KB

File metadata and controls

261 lines (204 loc) · 12 KB

Segment Encoder

NOTE: Run all of the following steps from <project_dir>/multiencoder.

Table of contents


pip install -r ../requirements.txt

Reproducing QMSum experiments

1. Preprocess QMSum

To perform the preprocessing of QMSum necessary to reproduce the experiments, follow the instructions in the preprocessing directory.

2. Convert to Segment Encoder format

To convert above files to a format that can be used by the Segment Encoder, run the following:


The output files will be in data/qmsum/preprocessed.

3. Train models

See scripts/train_qmsum_*.sh

4. Choose checkpoint for each run

bash scripts/

Copies best checkpoint for each run (based on mean validation rouge) to selected_checkpoint directory.

5. Generate predictions from selected checkpoints

bash scripts/

Writes out val predictions for all selected checkpoints to selected_checkpoint/predictions.val.

bash scripts/

Writes out test predictions for all selected checkpoints to selected_checkpoint/predictions.test.

6. Report rouge scores of all checkpoints

bash scripts/

Reports mean rouge scores on validation set.

bash scripts/

Reports mean rouge scores on test set.

Note that these last scripts may prompt you with a small number of additional install steps.

Pretrained models

We have provided checkpoints for our best performing QMSum-finetuned Segment Encoder model as reported in our paper (Table 5). The hyperparameters of note are:

  • Input size: 16384
  • Segment length: 512
  • Segment overlap: 256
  • Initial checkpoint: Wikisum-pretrained

Downloading checkpoints

We have included checkpoints for all 5 training runs of the model used in the final evaluation, along with their performance on the validation set:

Run ROUGE-1 ROUGE-2 ROUGE-L Checkpoint
1 38.85 13.00 34.13 download
2 38.50 12.87 33.92 download
3 38.66 13.01 34.07 download
4 38.16 12.90 33.73 download
5 38.74 12.81 34.08 download

Using checkpoints

To use a checkpoint, first download/untar it and then point the --model_name_or_path command-line argument in to the top-level directory of the checkpoint. (See the next section for examples of using to train/evaluate a model.) When using our provided checkpoint, also be sure to set the following arguments as follows to be consistent with the fine-tuning hyperparameters:

--multiencoder_max_num_chunks 32 \
--multiencoder_stride \
--max_source_len 512

(For an explanation of the command-line arguments, see next section.)


The example below demonstrates how to evaluate a checkpoint against the validation set. Note that you will first need to perform Steps 1 and 2 from the previous section to populate the data/qmsum/preprocessed/ directory.

python \
  --do_predict \
  --test_file data/qmsum/preprocessed/val.jsonl \
  --model_name_or_path PATH_TO_CHECKPOINT \
  --multiencoder_type bart \
  --multiencoder_max_num_chunks 32 \
  --multiencoder_stride \
  --max_source_len 512 \
  --output_dir PATH_TO_OUTPUT \
  --generation_max_len 256 \
  --val_max_target_length 256 \
  --per_device_eval_batch_size 1 \
  --predict_with_generate \
  --prediction_path PATH_TO_PREDICTION_OUTPUT

Note: the ROUGE scores obtained from the above script (based on Huggingface ROUGE implementation) may differ slightly from those reported in the table above (based on SummEval ROUGE implementation, which is consistent with the paper). See discussion of these two implementations below.

Running on your own datasets

1. Prepare data in appropriate format

The Segment Encoder data loaders expect a .jsonl file, with each line in the following format:

{"source": <full source document>, "query": <optional query>, "target": <summary>}

2. Train your model

You will need to execute with the appropriate command-line arguments. Below is a template for executing based on the hyperparameters for the best-performing model (scripts/ You will need to set train_file and validation_file to point to .jsonl files in the format described in Step 1, and output_dir to point to the directory where the model checkpoints will be saved.

python \
  --do_train \
  --train_file PATH_TO_TRAIN_FILE \
  --do_eval \
  --validation_file PATH_TO_VALIDATION_FILE \
  --model_name_or_path facebook/bart-large \
  --multiencoder_type bart \
  --multiencoder_max_num_chunks 32 \
  --multiencoder_stride \
  --max_source_len 512 \
  --learning_rate 0.000005 \
  --save_strategy epoch \
  --num_train_epochs 10 \
  --gradient_checkpointing \
  --output_dir PATH_TO_SAVE_MODEL \
  --per_device_train_batch_size 1 \
  --generation_max_len 256 \
  --val_max_target_length 256 \
  --evaluation_strategy epoch \
  --per_device_eval_batch_size 1 \
  --metric_for_best_model eval_mean_rouge \
  --compute_rouge_for_train \
  --predict_with_generate \
  --logging_strategy epoch \
  --load_best_model_at_end \
  --seed 1

Argument descriptions:

  • do_train: Required boolean flag
  • train_file: Path to your training file (in .jsonl format described above).
  • do_eval: Boolean flag to evaluate model on validation set during training
  • validation_file: Path to your optional validation file (in .jsonl format described above)
  • model_name_or_path: Name of or path to Huggingface model (recommend facebook/bart-large). Currently only supports BART checkpoints.
  • multiencoder_type: Set to bart
  • multiencoder_max_num_chunks: Number of segments
  • multiencoder_stride: Boolean flag to use 50%-overlap strides in segmentation. If not set, segments will be disjoint, which may degrade model performance.
  • max_source_len: Segment length
  • learning_rate: Learning rate (recommend 0.000005 if replicating paper experiments)
  • save_strategy: Set to epoch to save checkpoint at end of each epoch
  • num_train_epochs: Number of epochs
  • gradient_checkpointing (recommended for larger models): Boolean flag to turn on gradient checkpointing, which reduces memory footprint and increases compute. This may be necessary for some models depending on number of segments, size of segments, and GPU memory available.
  • output_dir: Output directory for saved model checkpoints and logs
  • per_device_train_batch_size: Batch size, typically 1 for larger models
  • generation_max_len and val_max_target_length: Set to the maximum target length
  • evaluation_strategy: Set to epoch if you wish to evaluate at the end of each epoch
  • per_device_eval_batch_size: Evaluation batch size, typically 1 for larger models
  • metric_for_best_model (see also compute_rouge_for_train and predict_with_generate below): Set to eval_mean_rouge (recommended) if you wish use mean rouge as criterion for selecting checkpoint. Leave off to use cross entropy.
  • compute_rouge_for_train: Include if you wish compute rouge as part of the eval in training (necessary if metric_for_best_model = eval_mean_rouge )
  • predict_with_generate: Required boolean flag if compute_rouge_for_train set to True
  • logging_strategy: Set to epoch to log results at end of each epoch
  • overwrite_output_dir: Boolean flag to overwrite output directory with multiple runs
  • load_best_model_at_end: Boolean flag to load the best checkpoint at the end
  • seed: Optional random seed
  • Optionally, other arguments for the Huggingface Seq2SeqTrainer specified in Seq2SeqTrainingArguments

See for documentation on other arguments. Note that is based on the standard HuggingFace training script for summarization, and uses many of the same command-line arguments.

3. Evaluate your model

There are two main options for evaluation, described below.

HuggingFace rouge metric (simpler)

This relies on datasets.load_metric().

Run with appropriate arguments for testing. Example template consistent with training template from Step 2:

python \
  --do_predict \
  --test_file PATH_TO_TEST_FILE \
  --model_name_or_path PATH_TO_SAVE_MODEL \
  --multiencoder_type bart \
  --multiencoder_max_num_chunks 32 \
  --multiencoder_stride \
  --max_source_len 512 \
  --output_dir PATH_TO_TEST_OUTPUT \
  --generation_max_len 256 \
  --val_max_target_length 256 \
  --per_device_eval_batch_size 1 \
  --predict_with_generate \
  --prediction_path PATH_TO_PREDICTION_OUTPUT

You will need to set test_file to a test file in the .jsonl format described in Step 1. Set model_name_or_path to the top-level PATH_TO_SAVE_MODEL specified in the training script; this top-level directory has the best-performing checkpoint according to the metric_for_best_model argument to the training script. Set output_dir to the directory where testing outputs will go and prediction_path to the file where generated predictions will go. If you change any model parameters in the training script be sure to update corresponding arguments in the test script (e.g. number of segments, segment length).

SummEval rouge metric

The SummEval implementation uses the original PERL script for computing rouge.

To run this, you will need to first run the test script above, and then additionally run based on the generated predictions from the test script. You can see examples of this in steps 5-6 in the Reproducing Experiments section.