Util

utils.util.load_pickle(path: str): Load pickle file from path.

utils.util.save_pickle(obj, path: str): Save pickle file.

utils.util.load_from_previous(model: torch.nn.Module, path: str): Load checkpoint from the older version of Uni-Retriever, only load model parameters and overrides the config by the current config.

utils.util.makedirs(path: str, exist_ok: bool = True)

Shortcut for creating parent directory for a file.

Parameters

path –
exist_ok – ignore if parent folder already exists

utils.util.readlink(path: str): Read permalink.

utils.util.isempty(path: str): Check if a folder is empty.

utils.util.update_hydra_config(config: dict): update the hydra config at inner layer by the one defined in the _global_ package layer

utils.util.flatten_hydra_config(config: dict): Flatten a two-layer hydra config dict

utils.util.synchronize(func: Optional[callable] = None): A function or a decorator to synchronize all processes on enterring and exiting the function.

utils.util.all_logging_disabled(highest_level=50): A context manager that will prevent any logging messages triggered during the body from being processed. :param highest_level: the maximum logging level in use.

This would only need to be changed if a custom level greater than CRITICAL is defined.

utils.util.compute_metrics(retrieval_result: Union[dict[int, list[int]], dict[int, list[tuple[int, float]]]], ground_truth: Union[dict[int, list[int]], dict[int, list[tuple[int, float]]]], cutoffs: list[int] = [10, 100, 1000], metrics: list[str] = ['mrr', 'recall'], return_each_query: bool = False)

Compute metrics given a retrieval_result and the ground_truth dict.

Parameters

retrieval_result – mapping query_id to its retrieved document ids
ground_truth – mapping query_id to its ground truth document ids
cutoffs – the cutoff to compute metrics
metrics – the metrics to compute
return_each_query – if true, return each query’s metric as np.array

utils.util.compute_metrics_nq(retrieval_result: Union[dict[int, list[int]], dict[int, list[tuple[int, float]]]], query_answer_path: str, collection_path: str)

Compute recall on NQ-open dataset; Since there is no ground-truth file, take the passage containing the answer as relevant one.

Parameters

retrieval_result – mapping query_id to its retrieved document ids
query_answer_path – the file containing the answers
collection_path – the collection file path

utils.util._get_title_code(input_path: str, output_path: str, all_line_count: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int, title_col: list, stop_words: set, separator: str = ' ', dedup=False, stem=False, filter_num=False, filter_unit=False)

Generate code based on titles of the NQ dataset. Add a padding token at the head.

Parameters

input_path – the collection file path
output_path – the np.memmap file path to save the codes
all_line_count – the total number of records in the collection
start_idx – the starting offset
end_idx – the ending offset
tokenizer (transformers.AutoTokenizer) –
max_length – the maximum length of tokens
text_col – the columns for title and others
stop_words – the words to exclude
separator –

Returns

the populated memmap file

utils.util._get_token_code(input_path: str, output_path: str, all_line_count: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int, init_order: str, post_order: str, stop_words: set, separator: str = ' ', stem=False, filter_num=False, filter_unit=False, weight_path=None)

Generate code based on json files produced by models.BaseModel.BaseModel.anserini_index(). First reorder the words by order, and tokenize the word sequence by tokenizer.

Parameters

input_path – the collection file path
output_path – the np.memmap file path to save the codes
all_line_count – the total number of records in the collection
start_idx – the starting idx
end_idx – the ending idx
tokenizer (transformers.AutoTokenizer) –
max_length – the maximum length of tokens
init_order – how to order the keywords: {weight, first, random, sample}
post_order – how to order the keywords that are among top K from the init_order: {random}
stop_words – some words to exclude
separator – used to separate semantic units
stem – use pyserini to stem words?
filter_num – filter out all numbers?
filter_unit – filter out all tokens with length equal to 1?
weight_path – if not None, store the weight of sorted semantic units

utils.util._get_chatgpt_code(input_path: str, output_path: str, all_line_count: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int)

Generate code from chatgpt keywords.

Parameters

input_path – the collection file path
output_path – the np.memmap file path to save the codes
all_line_count – the total number of records in the collection
start_idx – the starting idx
end_idx – the ending idx
tokenizer (transformers.AutoTokenizer) –
max_length – the maximum length of tokens
order – the word order {weight, lexical, orginal}
stop_words – some words to exclude
separator – used to separate words from words

class utils.util.Cluster(device: Union[int, Literal['cpu']] = 'cpu')

Mixin for performing a variety of clustering tasks based on faiss.

__init__(device: Union[int, Literal['cpu']] = 'cpu')

Parameters: device – the gpu id or cpu

cluster

the cluster object

Type: faiss.Clustering

index

the index to compute distance when clustering

Type: faiss.Index

_kmeans(embeddings: numpy.ndarray, k: int, metric: str = 'l2', niter: int = 25)

Perform KMeans over embeddings.

Parameters

embeddings –
k – the number of clusters
metric – the metric to compute distance
niter – number of iterations

kmeans(embeddings: numpy.ndarray, k: int, metric: str = 'l2', num_replicas: int = 1, **kargs) → numpy.ndarray

Fit and predict by kmeans.

Parameters

embeddings –
k – the number of clusters
metric – the metric to compute distance
num_replicas – how many nearest neighbor to record in the final assignmens
niter – number of iterations

Returns

assignments array of [num_samples, num_replicas]

hierarchical_kmeans(embeddings: numpy.ndarray, k: int, nleaf: int = 10, **kargs) → numpy.ndarray

Fit and predict by hierarchical kmeans.

Parameters

embeddings –
k – the number of clusters
nleaf – the maximum number of nodes in the leaf

Returns

assignments array of [num_samples, num_replicas]

class utils.util.MasterLogger(name: str)

The logger only outputs on the master node.

__init__(name: str) → None

class utils.util.DotDict

class utils.util.Config(*args, **kwargs)

Config object. A dot access OrderedDict.

__init__(*args, **kwargs): Launch distributed necessary parameters.

items() → a set-like object providing a view on D's items

_set_distributed(): Set up distributed nccl backend.

_set_plm(plm: Optional[str] = None, already_on_main_proc=False)

Load huggingface plms; download it if it doesn’t exist. One may add a new plm into the PLM_MAP object so that Manager knows how to download it (load_name) and where to store it cache files (tokenizer).

special_token_ids

stores the token and token_id of each special tokens

Type: Dict[Tuple]

class utils.util.BaseOutput(token_ids: Optional[numpy.ndarray] = None, embeddings: Optional[numpy.ndarray] = None, codes: Optional[numpy.ndarray] = None, index: Optional[Any] = None)

Basic output for models.BaseModel.BaseModel

__init__(token_ids: Optional[numpy.ndarray] = None, embeddings: Optional[numpy.ndarray] = None, codes: Optional[numpy.ndarray] = None, index: Optional[Any] = None) → None