[Public repository]
Experiment code for the paper:
Explaining Context Length Scaling and Bounds for Language Models [arXiv Link]
Authors: Jingzhe Shi
Abstract:
Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding on how long context impact Language Modeling. In this work, we (1) propose a clean and effective theoretical framework on explaining the impact of context length to Language Modeling, from an Intrinsic Space perspective; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain case. We hope our work may inspire new long context Language Models, as well as future work studying Physics for Language Models.
The repository contains three major parts of code.
-
Synthetic Data:
generating data, training models, measuring Cross Entropy Loss, obtaining middle-layer feature representation of our Synthetic Dataset
-
Measuring Natural Language:
generating text corpera, measuring CE Loss, obtaining middle-layer feature representation
-
Experiments on Openwebtext subset:
generating training sub-dataset, training gpt-2 with nanogpt
Please refer to SyntheticDataset/requirements.txt
for requirements.
Run generate_data.py
, which would generate training/validation set according to the task defined in task_definition.json
Run train_model.sh
with different context length, dataset sizes settings.
Run train_model.py
--use_bi_mlp
setting, and train Bi-mlp on different context length. Then, run save_feature_tensor.py
with the corresponding model weight to obtain middle features.
Please refer to NaturalLanguage/Measuring/requirements.txt
for requirements.
Please refer to prepare_data.py
for preparing datasets.
Run evaluate_CE.py
with different seq_lens
and model_name
settings to obtain results.
Run save_mid_features.py
to obtain different feature vectors.
Please refer to python scripts in NaturalLanguage\Measuring\draw_CEvsPCA
for drawing figures.
Please refer to NaturalLanguage/ContextLengthScalingTrainingExps/nanoGPT-master/README.md
for the installation of nanogpt.
Run data/openwebtext/prepare.py
with percent
and start_from
set to a appropriate value.
Run script.py
to generate config files (already generated now). Then, run sbatch train_script.sh 5120 2p0
to start training for context length 5120, dataset percent 2, on clusters we use. Please modify the script to run jobs on other machines.