Bi-Phase IVFPQ
This tutorial explains how to reproduce our paper Bi-Phase IVFPQ on MSMARCO passage collection.
Reproducing From Checkpoint
Make sure you finished the data processing steps in Data. Then you should download the checkpoint and the necessary index files on OneDrive. The uncompressed files would look like:
Bi-Phase-IVFPQ-MSMARCO ├── ckpts │ ├── DistillVQ_d-RetroMAE │ │ └── best │ ├── TokIVF │ │ └── best │ └── TopIVF │ └── best └── index └── RetroMAE └── faiss ├── IVF10000,PQ64x8 └── OPQ96,PQ96x8Move the
ckpts/*tosrc/data/cache/MSMARCO-passage/ckpts/. Move theindex/*tosrc/data/cache/MSMARCO-passage/index/.Since Bi-IVFPQ is a general IVFPQ framework. It relies on off-the-shelf embeddings to work. Here we use the distilled RetroMAE as the embedding model. We encode all documents and queries using RetroMAE and save the resulted embeddings:
# uses four gpus torchrun --nproc_per_node=4 run.py RetroMAE ++mode=eval ++plm=retromae_distill ++save_encode
The resulted file will be stored at
src/data/cache/MSMARCO-passage/encode/RetroMAE/. The evaluation defaults to use theFlatindex that scans the database for each query. The metrics should be similar to:MRR@10
Recall@10
Recall@100
Recall@1000
0.4155
0.708
0.9268
0.9876
Prepare PQ module.
python run.py DistillVQ_d-RetroMAE ++mode=eval ++save_index
This evaluates the performance of exaustive PQ with 96 subvectors, whose metrics should be similar to:
MRR@10
Recall@10
Recall@100
Recall@1000
0.3993
0.6846
0.9207
0.9845
Prepare Topic IVF.
python run.py TopIVF ++mode=eval
This evaluates the performance of topic-phase IVF followed by PQ verification when selecting
20topics for each query. The metrics should beMRR@10
Recall@10
Recall@100
Recall@1000
0.355
0.5947
0.7917
0.8423
Prepare Term IVF.
# you should use more gpus than TopIVF because TokIVF involves a BERT and hence heavier torchrun --nproc_per_node=4 run.py TokIVF ++mode=eval ++save_encode
If you encounter memory issues when building the inverted index, please run the above command with
++index_shard=64 ++index_thread=5(increase shard number and decrease parallel process number). The default values are specified atsrc/data/config/index/invhit.yaml.
This evaluates the performance of term-phase IVF followed by PQ verification when selecting
3terms for each document. The metrics should beMRR@10
Recall@10
Recall@100
Recall@1000
0.395
0.6741
0.8864
0.9315
Chain Them Together.
torchrun --nproc_per_node=4 run.py BIVFPQ
This evaluates Bi-phase IVFPQ under its default settings: 3 terms and 1 topic for each document to index, all included terms and 20 topics for each query to search. The results should be:
MRR@10
Recall@10
Recall@100
Recall@1000
0.3986
0.6829
0.9144
0.9742
You can easily try different configurations:
# index 5 terms for each document torchrun --nproc_per_node=4 run.py BIVFPQ ++x_text_gate_k=5 # search 10 topics for each query torchrun --nproc_per_node=4 run.py BIVFPQ ++y_query_gate_k=10
You can inspect
src/data/config/BIVFPQ.yamlfor more details.