Util

utils.util.load_pickle(path: str)

Load pickle file from path.

utils.util.save_pickle(obj, path: str)

Save pickle file.

utils.util.load_from_previous(model: torch.nn.Module, path: str)

Load checkpoint from the older version of Uni-Retriever, only load model parameters and overrides the config by the current config.

utils.util.makedirs(path: str, exist_ok: bool = True)

Shortcut for creating parent directory for a file.

Parameters
  • path

  • exist_ok – ignore if parent folder already exists

Read permalink.

utils.util.isempty(path: str)

Check if a folder is empty.

utils.util.update_hydra_config(config: dict)

update the hydra config at inner layer by the one defined in the _global_ package layer

utils.util.flatten_hydra_config(config: dict)

Flatten a two-layer hydra config dict

utils.util.synchronize(func: Optional[callable] = None)

A function or a decorator to synchronize all processes on enterring and exiting the function.

utils.util.all_logging_disabled(highest_level=50)

A context manager that will prevent any logging messages triggered during the body from being processed. :param highest_level: the maximum logging level in use.

This would only need to be changed if a custom level greater than CRITICAL is defined.

utils.util.compute_metrics(retrieval_result: Union[dict[int, list[int]], dict[int, list[tuple[int, float]]]], ground_truth: Union[dict[int, list[int]], dict[int, list[tuple[int, float]]]], cutoffs: list[int] = [10, 100, 1000], metrics: list[str] = ['mrr', 'recall'], return_each_query: bool = False)

Compute metrics given a retrieval_result and the ground_truth dict.

Parameters
  • retrieval_result – mapping query_id to its retrieved document ids

  • ground_truth – mapping query_id to its ground truth document ids

  • cutoffs – the cutoff to compute metrics

  • metrics – the metrics to compute

  • return_each_query – if true, return each query’s metric as np.array

utils.util.compute_metrics_nq(retrieval_result: Union[dict[int, list[int]], dict[int, list[tuple[int, float]]]], query_answer_path: str, collection_path: str)

Compute recall on NQ-open dataset; Since there is no ground-truth file, take the passage containing the answer as relevant one.

Parameters
  • retrieval_result – mapping query_id to its retrieved document ids

  • query_answer_path – the file containing the answers

  • collection_path – the collection file path

utils.util._get_title_code(input_path: str, output_path: str, all_line_count: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int, title_col: list, stop_words: set, separator: str = ' ', dedup=False, stem=False, filter_num=False, filter_unit=False)

Generate code based on titles of the NQ dataset. Add a padding token at the head.

Parameters
  • input_path – the collection file path

  • output_path – the np.memmap file path to save the codes

  • all_line_count – the total number of records in the collection

  • start_idx – the starting offset

  • end_idx – the ending offset

  • tokenizer (transformers.AutoTokenizer) –

  • max_length – the maximum length of tokens

  • text_col – the columns for title and others

  • stop_words – the words to exclude

  • separator

Returns

the populated memmap file

utils.util._get_token_code(input_path: str, output_path: str, all_line_count: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int, init_order: str, post_order: str, stop_words: set, separator: str = ' ', stem=False, filter_num=False, filter_unit=False, weight_path=None)

Generate code based on json files produced by models.BaseModel.BaseModel.anserini_index(). First reorder the words by order, and tokenize the word sequence by tokenizer.

Parameters
  • input_path – the collection file path

  • output_path – the np.memmap file path to save the codes

  • all_line_count – the total number of records in the collection

  • start_idx – the starting idx

  • end_idx – the ending idx

  • tokenizer (transformers.AutoTokenizer) –

  • max_length – the maximum length of tokens

  • init_order – how to order the keywords: {weight, first, random, sample}

  • post_order – how to order the keywords that are among top K from the init_order: {random}

  • stop_words – some words to exclude

  • separator – used to separate semantic units

  • stem – use pyserini to stem words?

  • filter_num – filter out all numbers?

  • filter_unit – filter out all tokens with length equal to 1?

  • weight_path – if not None, store the weight of sorted semantic units

utils.util._get_chatgpt_code(input_path: str, output_path: str, all_line_count: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int)

Generate code from chatgpt keywords.

Parameters
  • input_path – the collection file path

  • output_path – the np.memmap file path to save the codes

  • all_line_count – the total number of records in the collection

  • start_idx – the starting idx

  • end_idx – the ending idx

  • tokenizer (transformers.AutoTokenizer) –

  • max_length – the maximum length of tokens

  • order – the word order {weight, lexical, orginal}

  • stop_words – some words to exclude

  • separator – used to separate words from words

class utils.util.Cluster(device: Union[int, Literal['cpu']] = 'cpu')

Mixin for performing a variety of clustering tasks based on faiss.

__init__(device: Union[int, Literal['cpu']] = 'cpu')
Parameters

device – the gpu id or cpu

cluster

the cluster object

Type

faiss.Clustering

index

the index to compute distance when clustering

Type

faiss.Index

_kmeans(embeddings: numpy.ndarray, k: int, metric: str = 'l2', niter: int = 25)

Perform KMeans over embeddings.

Parameters
  • embeddings

  • k – the number of clusters

  • metric – the metric to compute distance

  • niter – number of iterations

kmeans(embeddings: numpy.ndarray, k: int, metric: str = 'l2', num_replicas: int = 1, **kargs) numpy.ndarray

Fit and predict by kmeans.

Parameters
  • embeddings

  • k – the number of clusters

  • metric – the metric to compute distance

  • num_replicas – how many nearest neighbor to record in the final assignmens

  • niter – number of iterations

Returns

assignments array of [num_samples, num_replicas]

hierarchical_kmeans(embeddings: numpy.ndarray, k: int, nleaf: int = 10, **kargs) numpy.ndarray

Fit and predict by hierarchical kmeans.

Parameters
  • embeddings

  • k – the number of clusters

  • nleaf – the maximum number of nodes in the leaf

Returns

assignments array of [num_samples, num_replicas]

class utils.util.MasterLogger(name: str)

The logger only outputs on the master node.

__init__(name: str) None
class utils.util.DotDict
class utils.util.Config(*args, **kwargs)

Config object. A dot access OrderedDict.

__init__(*args, **kwargs)

Launch distributed necessary parameters.

items() a set-like object providing a view on D's items
_set_distributed()

Set up distributed nccl backend.

_set_plm(plm: Optional[str] = None, already_on_main_proc=False)

Load huggingface plms; download it if it doesn’t exist. One may add a new plm into the PLM_MAP object so that Manager knows how to download it (load_name) and where to store it cache files (tokenizer).

special_token_ids

stores the token and token_id of each special tokens

Type

Dict[Tuple]

class utils.util.BaseOutput(token_ids: Optional[numpy.ndarray] = None, embeddings: Optional[numpy.ndarray] = None, codes: Optional[numpy.ndarray] = None, index: Optional[Any] = None)

Basic output for models.BaseModel.BaseModel

__init__(token_ids: Optional[numpy.ndarray] = None, embeddings: Optional[numpy.ndarray] = None, codes: Optional[numpy.ndarray] = None, index: Optional[Any] = None) None