Bi-Phase IVFPQ

This tutorial explains how to reproduce our paper Bi-Phase IVFPQ on MSMARCO passage collection.

Reproducing From Checkpoint

Make sure you finished the data processing steps in Data. Then you should download the checkpoint and the necessary index files on OneDrive. The uncompressed files would look like:

Bi-Phase-IVFPQ-MSMARCO
├── ckpts
│   ├── DistillVQ_d-RetroMAE
│   │   └── best
│   ├── TokIVF
│   │   └── best
│   └── TopIVF
│       └── best
└── index
   └── RetroMAE
      └── faiss
            ├── IVF10000,PQ64x8
            └── OPQ96,PQ96x8

Move the ckpts/* to src/data/cache/MSMARCO-passage/ckpts/. Move the index/* to src/data/cache/MSMARCO-passage/index/.

Since Bi-IVFPQ is a general IVFPQ framework. It relies on off-the-shelf embeddings to work. Here we use the distilled RetroMAE as the embedding model. We encode all documents and queries using RetroMAE and save the resulted embeddings:
```
# uses four gpus
torchrun --nproc_per_node=4 run.py RetroMAE ++mode=eval ++plm=retromae_distill ++save_encode
```
The resulted file will be stored at src/data/cache/MSMARCO-passage/encode/RetroMAE/. The evaluation defaults to use the Flat index that scans the database for each query. The metrics should be similar to:

MRR@10

Recall@10

Recall@100

Recall@1000

0.4155

0.708

0.9268

0.9876
Prepare PQ module.
```
python run.py DistillVQ_d-RetroMAE ++mode=eval ++save_index
```
This evaluates the performance of exaustive PQ with 96 subvectors, whose metrics should be similar to:

MRR@10

Recall@10

Recall@100

Recall@1000

0.3993

0.6846

0.9207

0.9845
Prepare Topic IVF.
```
python run.py TopIVF ++mode=eval
```
This evaluates the performance of topic-phase IVF followed by PQ verification when selecting 20 topics for each query. The metrics should be

MRR@10

Recall@10

Recall@100

Recall@1000

0.355

0.5947

0.7917

0.8423
Prepare Term IVF.
```
# you should use more gpus than TopIVF because TokIVF involves a BERT and hence heavier
torchrun --nproc_per_node=4 run.py TokIVF ++mode=eval ++save_encode
```
- If you encounter memory issues when building the inverted index, please run the above command with ++index_shard=64 ++index_thread=5 (increase shard number and decrease parallel process number). The default values are specified at src/data/config/index/invhit.yaml.
This evaluates the performance of term-phase IVF followed by PQ verification when selecting 3 terms for each document. The metrics should be

MRR@10

Recall@10

Recall@100

Recall@1000

0.395

0.6741

0.8864

0.9315
Chain Them Together.
```
torchrun --nproc_per_node=4 run.py BIVFPQ
```
This evaluates Bi-phase IVFPQ under its default settings: 3 terms and 1 topic for each document to index, all included terms and 20 topics for each query to search. The results should be:

MRR@10

Recall@10

Recall@100

Recall@1000

0.3986

0.6829

0.9144

0.9742

You can easily try different configurations:
```
# index 5 terms for each document
torchrun --nproc_per_node=4 run.py BIVFPQ ++x_text_gate_k=5
# search 10 topics for each query
torchrun --nproc_per_node=4 run.py BIVFPQ ++y_query_gate_k=10
```
You can inspect src/data/config/BIVFPQ.yaml for more details.