Reproducing Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution
Now accepted to AAAI 2024 Main track. Please see supplemental information in the appendix of the camera-ready + appendix version of the paper.
Please follow steps below to reproduce all data and plots.
If you would rather checkout currently running and more flexible demos, you can also reproduce the methods in this paper by checking out the open-source demos below, otherwise, skip to the Setup
:
-
Method 1: Spurious Correlations Dectection, Open Source Hugging Face Space:
- Gendered Pronoun Resolution Task: https://huggingface.co/spaces/emilylearning/spurious_correlation_evaluation.
- Generic NLP Task: https://huggingface.co/spaces/emilylearning/choose_your_own_spurious
-
Method 2: Underspecification Metric Measurement, Open Source Hugging Face Space: https://huggingface.co/spaces/emilylearning/llm_uncertainty.
-
More General Setting Toy SCM, Colab notebook: https://colab.research.google.com/github/2dot71mily/uspec/blob/main/Toy_DGP.ipynb.
git clone https://github.com/2dot71mily/uspec.git
cd uspec
python3 -m venv venv_uspec
source venv_uspec/bin/activate
python3 -m pip install --upgrade pip
# If without a GPU comment out `nvidia.*` and `triton` from requirements.tct
# BERT-like and OpenAI models can be tested without GPU
pip install -r requirements.txt
In config.py
, set METHOD_IDX
:
######################## Select method type #################################
METHODS = ['correlation_measurement', "specification_metric"]
METHOD_IDX = 0 # Select '0' for Method 1, '1' for for Method 2
Then run in terminal:
python spurious_plotting.py
In config.py
, set METHOD_IDX
& if WINO_G_ID
(Winogender Simplified
or not):
######################## Select method type #################################
METHODS = ['correlation_measurement', "specification_metric"]
METHOD_IDX = 1 # Select '0' for Method 1, '1' for for Method 2
...
# Set to `True` if testing with `Simplified` Winogender-gender-identified eval set
WINO_G_ID = False
Then run in terminal:
python uncertainty_plotting.py
Note: this process is a bit slow, due to the large number of files ingested.
First test the setup as shown below.
Note: The LLM weights will be downloaded / cached from Hugging Face. Throughout, it is expected that Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
In config.py
, set TESTING
, METHOD_IDX
, INFERENCE_TYPE_IDX
& MODEL_IDX
:
TESTING = True # Set to True if testing on subsets of the challenge sets
######################## Select method type #################################
METHODS = ['correlation_measurement', "specification_metric"]
METHOD_IDX = 0 # Select '0' for Method 1, '1' for for Method 2
...
########################## Select inference type ############################
INFERENCE_TYPES = ["bert_like", "open_ai", "cond_gen"]
INFERENCE_TYPE_IDX = 2
# Select '0' for BERT-like, '1' for `open-ai`, 2 for `UL2-family`
...
###################### Select model type #######################
...
CONDITIONAL_GEN_MODELS = ["UL2-20B-denoiser", "UL2-20B-gen", "Flan-UL2-20B"]
MODEL_IDX = 0
##################### If model is GPT ##################
# Required OpenAI API key only for `open_ai` type inference
OPENAI_API_KEY = '<OpenAI API key>'
Then run in terminal:
python inference.py
In config.py
, set TESTING
, METHOD_IDX
, if WINO_G_ID
(Winogender Simplified
or not), INFERENCE_TYPE_IDX
& MODEL_IDX
:
TESTING = True # Set to True if testing on subsets of the challenge sets
######################## Select method type #################################
METHODS = ['correlation_measurement', "specification_metric"]
METHOD_IDX = 1 # Select '0' for Method 1, '1' for for Method 2
...
########################## Select inference type ############################
INFERENCE_TYPES = ["bert_like", "open_ai", "cond_gen"]
INFERENCE_TYPE_IDX = 2
# Select '0' for BERT-like, '1' for `open-ai`, 2 for `UL2-family`
...
###################### Select model type #######################
...
CONDITIONAL_GEN_MODELS = ["UL2-20B-denoiser", "UL2-20B-gen", "Flan-UL2-20B"]
MODEL_IDX = 0
##################### If model is GPT ##################
# Required OpenAI API key only for `open_ai` type inference
OPENAI_API_KEY = '<OpenAI API key>'
Then run in terminal:
python inference.py
In config.py
, same as Method 1 above but set:
TESTING = False
Then run in terminal:
python inference.py
In config.py
, same as Method 2 above but set:
TESTING = False
Then run in terminal:
python inference.py