preprocess.py
- scripts.preprocess.init_text(collection_path: str, cache_dir: str) dict[str, int]
convert document ids to offsets
- scripts.preprocess.init_query_and_qrel(query_path: str, qrel_path: str, cache_dir: str, tid2index: dict[str, int])
Tokenize query file and transfer passage/document/query id in qrel file to its index in the saved token-ids matrix.
- Parameters
query_path – query file path
qrel_path – qrel file path
cache_dir – the directory to save files
tid2index – mapping from text ids to text indices
- scripts.preprocess.tokenize_to_memmap(input_path: str, cache_dir: str, num_rec: int, max_length: int, tokenizer: Any, tokenizer_type: str, tokenize_thread: int, text_col: Optional[list[int]] = None, text_col_sep: Optional[str] = None, is_query: bool = False) dict[str, int]
tokenize the passage/document text in multiple threads
- Parameters
input_path – query/passage/document file path
cache_dir – save the output token ids etc
num_rec – the number of records
max_length – max length of tokens
tokenizer (transformers.Tokenizer) –
tokenizer_type – the actual tokenizer vocabulary used
tokenize_thread –
text_col –
text_col_sep –
is_query – if the input is a query
- Returns
mapping from the id to the index in the saved token-id matrix
- scripts.preprocess._tokenize_to_memmap(input_path: str, output_dir: str, num_rec: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int, text_col: Optional[list[int]] = None, text_col_sep: Optional[str] = None, is_query: bool = False)
Tokenize the input text;
do padding and truncation;
then save the token ids, token_lengths, text ids
- Parameters
input_path – input text file path
output_dir – directory of output numpy arrays
start_idx – the begining index to read
end_idx – the ending index
tokenizer – transformer tokenizer
max_length – max length of tokens
text_col –
text_col_sep –
is_query –