preprocess.py

scripts.preprocess.init_text(collection_path: str, cache_dir: str) dict[str, int]

convert document ids to offsets

scripts.preprocess.init_query_and_qrel(query_path: str, qrel_path: str, cache_dir: str, tid2index: dict[str, int])

Tokenize query file and transfer passage/document/query id in qrel file to its index in the saved token-ids matrix.

Parameters
  • query_path – query file path

  • qrel_path – qrel file path

  • cache_dir – the directory to save files

  • tid2index – mapping from text ids to text indices

scripts.preprocess.tokenize_to_memmap(input_path: str, cache_dir: str, num_rec: int, max_length: int, tokenizer: Any, tokenizer_type: str, tokenize_thread: int, text_col: Optional[list[int]] = None, text_col_sep: Optional[str] = None, is_query: bool = False) dict[str, int]

tokenize the passage/document text in multiple threads

Parameters
  • input_path – query/passage/document file path

  • cache_dir – save the output token ids etc

  • num_rec – the number of records

  • max_length – max length of tokens

  • tokenizer (transformers.Tokenizer) –

  • tokenizer_type – the actual tokenizer vocabulary used

  • tokenize_thread

  • text_col

  • text_col_sep

  • is_query – if the input is a query

Returns

mapping from the id to the index in the saved token-id matrix

scripts.preprocess._tokenize_to_memmap(input_path: str, output_dir: str, num_rec: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int, text_col: Optional[list[int]] = None, text_col_sep: Optional[str] = None, is_query: bool = False)
  1. Tokenize the input text;

  2. do padding and truncation;

  3. then save the token ids, token_lengths, text ids

Parameters
  • input_path – input text file path

  • output_dir – directory of output numpy arrays

  • start_idx – the begining index to read

  • end_idx – the ending index

  • tokenizer – transformer tokenizer

  • max_length – max length of tokens

  • text_col

  • text_col_sep

  • is_query