Util
- utils.util.load_from_previous(model: torch.nn.Module, path: str)
Load checkpoint from the older version of Uni-Retriever, only load model parameters and overrides the config by the current config.
- utils.util.makedirs(path: str, exist_ok: bool = True)
Shortcut for creating parent directory for a file.
- Parameters
path –
exist_ok – ignore if parent folder already exists
- utils.util.update_hydra_config(config: dict)
update the hydra config at inner layer by the one defined in the _global_ package layer
- utils.util.synchronize(func: Optional[callable] = None)
A function or a decorator to synchronize all processes on enterring and exiting the function.
- utils.util.all_logging_disabled(highest_level=50)
A context manager that will prevent any logging messages triggered during the body from being processed. :param highest_level: the maximum logging level in use.
This would only need to be changed if a custom level greater than CRITICAL is defined.
- utils.util.compute_metrics(retrieval_result: Union[dict[int, list[int]], dict[int, list[tuple[int, float]]]], ground_truth: Union[dict[int, list[int]], dict[int, list[tuple[int, float]]]], cutoffs: list[int] = [10, 100, 1000], metrics: list[str] = ['mrr', 'recall'], return_each_query: bool = False)
Compute metrics given a
retrieval_resultand theground_truthdict.- Parameters
retrieval_result – mapping query_id to its retrieved document ids
ground_truth – mapping query_id to its ground truth document ids
cutoffs – the cutoff to compute metrics
metrics – the metrics to compute
return_each_query – if true, return each query’s metric as
np.array
- utils.util.compute_metrics_nq(retrieval_result: Union[dict[int, list[int]], dict[int, list[tuple[int, float]]]], query_answer_path: str, collection_path: str)
Compute recall on NQ-open dataset; Since there is no ground-truth file, take the passage containing the answer as relevant one.
- Parameters
retrieval_result – mapping query_id to its retrieved document ids
query_answer_path – the file containing the answers
collection_path – the collection file path
- utils.util._get_title_code(input_path: str, output_path: str, all_line_count: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int, title_col: list, stop_words: set, separator: str = ' ', dedup=False, stem=False, filter_num=False, filter_unit=False)
Generate code based on titles of the NQ dataset. Add a padding token at the head.
- Parameters
input_path – the collection file path
output_path – the
np.memmapfile path to save the codesall_line_count – the total number of records in the collection
start_idx – the starting offset
end_idx – the ending offset
tokenizer (transformers.AutoTokenizer) –
max_length – the maximum length of tokens
text_col – the columns for title and others
stop_words – the words to exclude
separator –
- Returns
the populated memmap file
- utils.util._get_token_code(input_path: str, output_path: str, all_line_count: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int, init_order: str, post_order: str, stop_words: set, separator: str = ' ', stem=False, filter_num=False, filter_unit=False, weight_path=None)
Generate code based on json files produced by
models.BaseModel.BaseModel.anserini_index(). First reorder the words byorder, and tokenize the word sequence bytokenizer.- Parameters
input_path – the collection file path
output_path – the
np.memmapfile path to save the codesall_line_count – the total number of records in the collection
start_idx – the starting idx
end_idx – the ending idx
tokenizer (transformers.AutoTokenizer) –
max_length – the maximum length of tokens
init_order – how to order the keywords: {weight, first, random, sample}
post_order – how to order the keywords that are among top K from the init_order: {random}
stop_words – some words to exclude
separator – used to separate semantic units
stem – use pyserini to stem words?
filter_num – filter out all numbers?
filter_unit – filter out all tokens with length equal to 1?
weight_path – if not None, store the weight of sorted semantic units
- utils.util._get_chatgpt_code(input_path: str, output_path: str, all_line_count: int, start_idx: int, end_idx: int, tokenizer: Any, max_length: int)
Generate code from chatgpt keywords.
- Parameters
input_path – the collection file path
output_path – the
np.memmapfile path to save the codesall_line_count – the total number of records in the collection
start_idx – the starting idx
end_idx – the ending idx
tokenizer (transformers.AutoTokenizer) –
max_length – the maximum length of tokens
order – the word order {weight, lexical, orginal}
stop_words – some words to exclude
separator – used to separate words from words
- class utils.util.Cluster(device: Union[int, Literal['cpu']] = 'cpu')
Mixin for performing a variety of clustering tasks based on
faiss.- __init__(device: Union[int, Literal['cpu']] = 'cpu')
- Parameters
device – the gpu id or cpu
- cluster
the cluster object
- Type
faiss.Clustering
- index
the index to compute distance when clustering
- Type
faiss.Index
- _kmeans(embeddings: numpy.ndarray, k: int, metric: str = 'l2', niter: int = 25)
Perform KMeans over
embeddings.- Parameters
embeddings –
k – the number of clusters
metric – the metric to compute distance
niter – number of iterations
- kmeans(embeddings: numpy.ndarray, k: int, metric: str = 'l2', num_replicas: int = 1, **kargs) numpy.ndarray
Fit and predict by kmeans.
- Parameters
embeddings –
k – the number of clusters
metric – the metric to compute distance
num_replicas – how many nearest neighbor to record in the final assignmens
niter – number of iterations
- Returns
assignments array of [num_samples, num_replicas]
- class utils.util.DotDict
- class utils.util.Config(*args, **kwargs)
Config object. A dot access OrderedDict.
- __init__(*args, **kwargs)
Launch distributed necessary parameters.
- items() a set-like object providing a view on D's items
- _set_distributed()
Set up distributed nccl backend.
- _set_plm(plm: Optional[str] = None, already_on_main_proc=False)
Load huggingface plms; download it if it doesn’t exist. One may add a new plm into the
PLM_MAPobject so that Manager knows how to download it (load_name) and where to store it cache files (tokenizer).- special_token_ids
stores the token and token_id of each special tokens
- Type
Dict[Tuple]