This repository contains a collection of useful retrieval tools, inspired by previous codebases, such as Contriever. The main difference from previous repos like Contriever and DPR-Scale is the additional support for different types of retrievers like BM25, SentenceTransformers, and API calls.
The goal of this repository is to provide a simple, easy-to-use, and efficient codebase for retrieval tasks.
We support the following operations:
- Building large-sclae dense indices with sharding
- Retrieving across sharded dense indices on GPUs and merging the results
- Common API for different retrieval models (e.g., BM25, SentenceTransformers, API calls)
The code is especially customized towards a slurm-based cluster, and we optimize for easy paralleization and efficient memory usage. This can be useful for large-scale retrieval tasks, where we might want to annotate a pre-training scale corpus with retrieval results. We also provide a simple API for playing around with different retrieval models by enabling serving indices.
Note
This repository is still under development, please be patient as I'm working on adding more features and improving the documentation! Check the TODOs section for more information.
Encoder
: Fast encoding with dense modelsCorpus
: A collection of documents with ids, text, and other metadataIndex
: A dense index that supports vector search with FAISSRetriever
: A retrieval model that supports different retrieval models
You should set up a virtual environment with all the dependencies. Then, you can install the package with:
pip install -e .
Then, you should be able to import the package with:
import retrievaltools as rt
# load the retriever
retriever = rt.load_retriever(rt.RetrieverOptions(retriever_type="web_search", cache_path="cache/serper.json"))
We use FAISS to support different types of dense indices; however, it can be tricky to get the environment exactly right. In practice, I find it easier to use a separate conda environment specifically for running FAISS GPU.
For simple encoding, you may not need FAISS, and you can install all packages with:
pip install -r requirements.txt
You should install torch
following these instructions to match your CUDA version.
FAISS is critical for the retrieval step if you are using a dense index; it is responsible for fast indexing and supports many useful functions (e.g., quantization, multi-gpu index, etc.).
To install the package, you should set up a conda environment and install PyTorch and FAISS (guide here).
Additionally, you should install transformers
and sentence-transformers
.
You also want to install datatools
here.
The stages of retrieval are as follows for single-vector dense retrievers (e.g., DPR, Contriever, etc.):
- Embedding the corpus (
generate_passage_embeddings.py
): Encode the corpus into dense vectors - Retrieve for queries (
passage_retrieval.py
): For each query, retrieve the top-k passages from the corpus - Add text (
text_annotation.py
): Optionally, add the passage text to the retrieved results if only the passage ids were stored.
For smaller corpora (e.g., Wikipedia which has ~20M passages), you may follow these steps, using Wikipedia as an example:
- Prepare the corpus: Put the corpus that you want to encode in file, supported file formats are
.tsv
and.jsonl
. In this example, we use the Wikipedia dump, which can be downloaded from the DPR repository.
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
gunzip psgs_w100.tsv.gz
When building a large retrieval corpus (>500M documents or >100B tokens), it is often necessary to shard the corpus and parallelize the encoding process.
- [] Save passage text instead of loading from file
Please email me at [email protected]
if you run into any issues or have any questions.