preprocess.py

scripts.preprocess.init_text(collection_path: str, cache_dir: str) → dict[str, int]: convert document ids to offsets

scripts.preprocess.init_query_and_qrel(query_path: str, qrel_path: str, cache_dir: str, tid2index: dict[str, int])

Tokenize query file and transfer passage/document/query id in qrel file to its index in the saved token-ids matrix.

Parameters

query_path – query file path
qrel_path – qrel file path
cache_dir – the directory to save files
tid2index – mapping from text ids to text indices

scripts.preprocess.tokenize_to_memmap(input_path: str, cache_dir: str, num_rec: int, max_length: int, tokenizer: Any, tokenizer_type: str, tokenize_thread: int, text_col: Optional[list[int]] = None, text_col_sep: Optional[str] = None, is_query: bool = False) → dict[str, int]

tokenize the passage/document text in multiple threads

Parameters

input_path – query/passage/document file path
cache_dir – save the output token ids etc
num_rec – the number of records
max_length – max length of tokens
tokenizer (transformers.Tokenizer) –
tokenizer_type – the actual tokenizer vocabulary used
tokenize_thread –
text_col –
text_col_sep –
is_query – if the input is a query

Returns

mapping from the id to the index in the saved token-id matrix

scripts.preprocess._tokenize_to_memmap(input_path: str, output_dir: str, num_rec: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int, text_col: Optional[list[int]] = None, text_col_sep: Optional[str] = None, is_query: bool = False)

Tokenize the input text;
do padding and truncation;
then save the token ids, token_lengths, text ids

Parameters

input_path – input text file path
output_dir – directory of output numpy arrays
start_idx – the begining index to read
end_idx – the ending index
tokenizer – transformer tokenizer
max_length – max length of tokens
text_col –
text_col_sep –
is_query –