ELEET is an Execution Engine to run multi-modal queries over datasets containing texts and tables. This is the implementation described in
Matthias Urban and Carsten Binnig: "ELEET: Efficient Learned Query Execution over Text and Tables.", PVLDB, 17(13): 4867-4880, 2024. PDF
- Code regarding Pre-training (i.e. corpus construction and pre-training scripts) is located in "eleet_pretrain"
- Code for everything else (e.g. query plans, MMOps, baselines, benchmark) is located in "eleet".
- Some scripts are located in "scripts" and "slurm" as described below.
⚠️ Due to some issues regarding outdated dependencies, we are currently working on updating the codebase to work with the latest versions of the libraries.
Install ELEET and all necessary requirements.
git clone [email protected]:DataManagementLab/eleet.git # clone repo
cd eleet
git submodule update --init
conda env create -f environment.yml # setup environment
conda activate eleet
pip install git+https://github.com/meta-llama/llama.git@llama_v2 # install LLaMA
pip install -e .
cd TaBERT/ && pip install -e . && cd .. # install TaBERT
python -m spacy download en_core_web_sm
gdown 1JIvXC0ajRZRCENMlLD3En7SGoodjP6O3
tar -xzvf pretrained-model.tar.gz
gdown 1hFCwdf8CIWDpE3KdHVfT8uQVrpE1Bv1c
tar -xzvf datasets.tar.gz
-
Run finetuning:
sbatch slurm/rotowire/train-ours.slurm
(Repeat for other datasets and models). --> Will store finetuned model in models/rotowire/ours/finetuned -
Mark finetuned model to be used:
cd models/rotowire/ours/finetuned ln -s <git-commit-hash_git-commit-msg>/<YYYY-MM-DD_hh-mm-ss...> current # adapt to the output dir that was created on your machine cd ../../../../
-
Run evaluation (Exp 1):
python eleet/benchmark.py --slurm-mode --use-test-set --split-sizes 64
-
Run evaluation (Exp 2+3):
python eleet/benchmark.py --slurm-mode --use-test-set --only-flagged
-
Visualize results using Jupyter notebooks located in
scripts/*.ipynb
- You can download the pre-training dataset here
Alternatively, you can also generate the pre-training dataset from Wikidata:
- Run MongoDB and set environment variables (MONGO_USER, MONGO_PASSWORD, MONGO_HOST, MONGO_PORT, MONGO_DB) https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-ubuntu/
- Start data pre-processing: python scripts/load_data.py trex-wikidata --> preprocessed data will appear in datasets/preprocessed_data/preprocessed_trex-wikidata*
- Use slurm/pretrain.slurm for pre-training (Adjust path in file first to point to pre-training dataset). --> Will store pretrained model in models/pretrained
- Generate TREx Dataset:
python eleet/datasets/trex/generate.py
- Generate Rotowire Dataset:
python eleet/datasets/rotowire/generate.py
If you use code or the benchmarks of this repository then please cite our paper:
@inproceedings{eleet,
title={ELEET: Efficient Learned query Execution over Text and Tables},
author = {Matthias Urban and Carsten Binnig},
journal={Proceedings of the VLDB Endowment},
volume={17},
number={13},
pages={4867--4880},
year={2024},
publisher={VLDB Endowment}
}