autofaiss.external package

Submodules

autofaiss.external.build module

gather functions necessary to build an index

autofaiss.external.build.create_index(embedding_reader, index_key, metric_type, current_memory_available, embedding_ids_df_handler=None, use_gpu=False, make_direct_map=False, distributed=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', nb_indices_to_keep=1, index_optimizer=None)[source]

Function that returns an index on the numpy arrays stored on disk in the embeddings_path path.

Return type

Tuple[Optional[Index], Dict[str, str]]

autofaiss.external.build.estimate_memory_required_for_index_creation(nb_vectors, vec_dim, index_key=None, max_index_memory_usage=None, make_direct_map=False, nb_indices_to_keep=1)[source]

Estimates the RAM necessary to create the index The value returned is in Bytes

Return type

Tuple[int, str]

autofaiss.external.build.get_estimated_construction_time_infos(nb_vectors, vec_dim, indent=0)[source]

Gives a general approximation of the construction time of the index

Return type

str

autofaiss.external.descriptions module

File that contains the descriptions of the different indices features.

class autofaiss.external.descriptions.IndexBlock(value)[source]

Bases: enum.Enum

An enumeration.

FLAT = 0
HNSW = 3
IVF = 1
IVF_HNSW = 2
OPQ = 5
PAD = 6
PQ = 4
class autofaiss.external.descriptions.TunableParam(value)[source]

Bases: enum.Enum

An enumeration.

EFSEARCH = 0
HT = 2
NPROBE = 1

autofaiss.external.metadata module

Index metadata for Faiss indices.

class autofaiss.external.metadata.IndexMetadata(index_key, nb_vectors, dim_vector, make_direct_map=False)[source]

Bases: object

Class to compute index metadata given the index_key, the number of vectors and their dimension.

Note: We don’t create classes for each index type in order to keep the code simple.

compute_memory_necessary_for_ivf_flat(nb_training_vectors)[source]

Compute the memory estimation for index type IVF_FLAT.

compute_memory_necessary_for_opq_ivf_hnsw_pq(nb_training_vectors)[source]

Compute the memory estimation for index type OPQ_IVF_HNSW_PQ.

Return type

float

compute_memory_necessary_for_opq_ivf_pq(nb_training_vectors)[source]

Compute the memory estimation for index type OPQ_IVF_PQ.

Return type

float

compute_memory_necessary_for_pad_ivf_hnsw_pq(nb_training_vectors)[source]

Compute the memory estimation for index type PAD_IVF_HNSW_PQ.

compute_memory_necessary_for_training(nb_training_vectors)[source]

Function that computes the memory necessary to train an index with nb_training_vectors vectors

Return type

float

estimated_index_size_in_bytes()[source]

Compute the estimated size of the index in bytes.

Return type

int

get_index_description(tunable_parameters_infos=False)[source]

Gives a generic description of the index.

Return type

str

get_index_type()[source]

return the index type.

Return type

IndexType

class autofaiss.external.metadata.IndexType(value)[source]

Bases: enum.Enum

An enumeration.

FLAT = 0
HNSW = 1
IVF_FLAT = 6
NOT_SUPPORTED = 5
OPQ_IVF_HNSW_PQ = 3
OPQ_IVF_PQ = 2
PAD_IVF_HNSW_PQ = 4
autofaiss.external.metadata.compute_memory_necessary_for_training_wrapper(nb_training_vectors, index_key, dim_vector)[source]

autofaiss.external.optimize module

Functions to find optimal index parameters

autofaiss.external.optimize.binary_search_on_param(index, parameter_range, max_speed_ms, hyperparameter_str_from_param, timeout_boost_for_precision_search=6.0, use_gpu=False, max_timeout_per_iteration_s=1.0)[source]

Apply a binary search on a given hyperparameter to maximize the recall given a query speed constraint in milliseconds/query.

Parameters
  • index (faiss.Index) – Index to search on.

  • parameter_range (List[T]) – List of possible values for the hyperparameter.

  • max_speed_ms (float) – Maximum query speed in milliseconds/query.

  • hyperparameter_str_from_param (Callable[[T], str]) – Function to generate a hyperparameter string from the hyperparameter value on which we do a binary search.

  • timeout_boost_for_precision_search (float) – Timeout boost for the precision search phase.

  • use_gpu (bool) – Whether the index is on the GPU.

  • max_timeout_per_iteration_s (float) – Maximum timeout per iteration in seconds.

autofaiss.external.optimize.check_if_index_needs_training(index_key)[source]

Function that checks if the index needs to be trained

Return type

bool

autofaiss.external.optimize.get_optimal_batch_size(vec_dim, current_memory_available)[source]

compute optimal batch size to use the RAM at its full potential for adding vectors

Return type

int

autofaiss.external.optimize.get_optimal_hyperparameters(index, index_key, max_speed_ms, use_gpu=False, max_timeout_per_iteration_s=1.0, min_ef_search=32)[source]

Find the optimal hyperparameters to maximize the recall given a query speed in milliseconds/query

Return type

str

autofaiss.external.optimize.get_optimal_index_keys_v2(nb_vectors, dim_vector, max_index_memory_usage, flat_threshold=1000, quantization_threshold=10000, force_pq=None, make_direct_map=False, should_be_memory_mappable=False, ivf_flat_threshold=1000000, use_gpu=False)[source]

Gives a list of interesting indices to try, the one at the top is the most promising

See: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index for detailed explanations.

Return type

List[str]

autofaiss.external.optimize.get_optimal_ivf(nb_vectors)[source]

Function that returns a list of relevant index_keys to create not quantized IVF indices.

Parameters

nb_vectors (int) – Number of vectors in the dataset.

Return type

List[str]

autofaiss.external.optimize.get_optimal_nb_clusters(nb_vectors)[source]

Returns a list with the recommended number of clusters for an index containing nb_vectors vectors. The first value is the most recommended one. see: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

Return type

List[int]

autofaiss.external.optimize.get_optimal_quantization(nb_vectors, dim_vector, force_quantization_value=None, force_max_index_memory_usage=None)[source]

Function that returns a list of relevant index_keys to create quantized indices.

nb_vectors: int

Number of vectors in the dataset.

dim_vector: int

Dimension of the vectors in the dataset.

force_quantization_value: Optional[int]

Force to use this value as the size of the quantized vectors (PQx). It can be used with the force_max_index_memory_usage parameter, but the result might be empty.

force_max_index_memory_usage: Optional[str]

Add a memory constraint on the index. It can be used with the force_quantization_value parameter, but the result might be empty.

index_keys: List[str]

List of index_keys that would be good choices for quantization. The list can be empty if the given constraints are too strong.

Return type

List[str]

autofaiss.external.optimize.get_optimal_train_size(nb_vectors, index_key, current_memory_available, vec_dim)[source]

Function that determines the number of training points necessary to train the index, based on faiss heuristics for k-means clustering.

Return type

int

autofaiss.external.optimize.index_key_to_nb_cluster(index_key)[source]

Function that takes an index key and returns the number of clusters

Return type

int

autofaiss.external.optimize.optimize_and_measure_index(embedding_reader, index, index_infos_path, index_key, index_param, index_path, max_index_query_time_ms, save_on_disk, use_gpu)[source]

Optimize one index by selecting the best hyperparameters and calculate its metrics

autofaiss.external.quantize module

main file to create an index from the the begining

autofaiss.external.quantize.build_index(embeddings, index_path='knn.index', index_infos_path='index_infos.json', ids_path=None, save_on_disk=True, file_format='npy', embedding_column_name='embedding', id_columns=None, index_key=None, index_param=None, max_index_query_time_ms=10.0, max_index_memory_usage='16G', current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False, should_be_memory_mappable=False, distributed=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', verbose=20, nb_indices_to_keep=1)[source]

Reads embeddings and creates a quantized index from them. The index is stored on the current machine at the given output path.

Parameters
  • embeddings (Union[str, np.ndarray, List[str]]) – Local path containing all preprocessed vectors and cached files. This could be a single directory or multiple directories. Files will be added if empty. Or directly the Numpy array of embeddings

  • index_path (Optional(str)) – Destination path of the quantized model.

  • index_infos_path (Optional(str)) – Destination path of the metadata file.

  • ids_path (Optional(str)) – Only useful when id_columns is not None and file_format=`parquet`. T his will be the path (in any filesystem) where the mapping files Ids->vector index will be store in parquet format

  • save_on_disk (bool) – Whether to save the index on disk, default to True.

  • file_format (Optional(str)) – npy or parquet ; default npy

  • embedding_column_name (Optional(str)) – embeddings column name for parquet ; default embedding

  • id_columns (Optional(List[str])) – Can only be used when file_format=`parquet`. In this case these are the names of the columns containing the Ids of the vectors, and separate files will be generated to map these ids to indices in the KNN index ; default None

  • index_key (Optional(str)) – Optional string to give to the index factory in order to create the index. If None, an index is chosen based on an heuristic.

  • index_param (Optional(str)) – Optional string with hyperparameters to set to the index. If None, the hyper-parameters are chosen based on an heuristic.

  • max_index_query_time_ms (float) – Bound on the query time for KNN search, this bound is approximative

  • max_index_memory_usage (str) – Maximum size allowed for the index, this bound is strict

  • current_memory_available (str) – Memory available on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.

  • use_gpu (bool) – Experimental, gpu training is faster, not tested so far

  • metric_type (str) –

    Similarity function used for query:
    • ”ip” for inner product

    • ”l2” for euclidian distance

  • nb_cores (Optional[int]) – Number of cores to use. Will try to guess the right number if not provided

  • make_direct_map (bool) – Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings)

  • should_be_memory_mappable (bool) – If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM. Default to False

  • distributed (Optional[str]) – If “pyspark”, create the indices using pyspark. Only “parquet” file format is supported.

  • temporary_indices_folder (str) – Folder to save the temporary small indices that are generated by each spark executor. Only used when distributed = “pyspark”.

  • verbose (int) – set verbosity of outputs via logging level, default is logging.INFO

  • nb_indices_to_keep (int) –

    Number of indices to keep at most when distributed is “pyspark”. It allows you to build an index larger than current_memory_available If it is not equal to 1,

    • You are expected to have at most nb_indices_to_keep indices with the following names:

      ”{index_path}i” where i ranges from 1 to nb_indices_to_keep

    • build_index returns a mapping from index path to metrics

    Default to 1.

Return type

Tuple[Optional[Any], Optional[Dict[str, str]]]

autofaiss.external.quantize.main()[source]

Main entry point

autofaiss.external.quantize.score_index(index_path, embeddings, save_on_disk=True, output_index_info_path='infos.json', current_memory_available='32G', verbose=20)[source]

Compute metrics on a given index, use cached ground truth for fast scoring the next times.

Parameters
  • index_path (Union[str, Any]) – Path to .index file. Or in memory index

  • embeddings (Union[str, np.ndarray]) – Path containing all preprocessed vectors and cached files. Can also be an in memory array.

  • save_on_disk (bool) – Whether to save on disk

  • output_index_info_path (str) – Path to index infos .json

  • current_memory_available (str) – Memory available on the current machine, having more memory is a boost because it reduces the swipe between RAM and disk.

  • verbose (int) – set verbosity of outputs via logging level, default is logging.INFO

Return type

Optional[Dict[str, Union[str, float, int]]]

autofaiss.external.quantize.setup_logging(logging_level)[source]

Setup the logging.

autofaiss.external.quantize.tune_index(index_path, index_key, index_param=None, output_index_path=None, save_on_disk=True, max_index_query_time_ms=10.0, use_gpu=False, verbose=20)[source]

Set hyperparameters to the given index.

If an index_param is given, set this hyperparameters to the index, otherwise perform a greedy heusistic to make the best out or the max_index_query_time_ms constraint

Parameters
  • index_path (Union[str, Any]) – Path to .index file Can also be an index

  • index_key (str) – String to give to the index factory in order to create the index.

  • index_param (Optional(str)) – Optional string with hyperparameters to set to the index. If None, the hyper-parameters are chosen based on an heuristic.

  • output_index_path (str) – Path to the newly created .index file

  • save_on_disk (bool) – Whether to save the index on disk, default to True.

  • max_index_query_time_ms (float) – Query speed constraint for the index to create.

  • use_gpu (bool) – Experimental, gpu training is faster, not tested so far.

  • verbose (int) – set verbosity of outputs via logging level, default is logging.INFO

Returns

The faiss index

Return type

index

autofaiss.external.scores module

Functions to compute metrics on an index

autofaiss.external.scores.compute_fast_metrics(embedding_reader, index, omp_threads=None, query_max=1000)[source]

compute query speed, size and reconstruction of an index

Return type

Dict

autofaiss.external.scores.compute_medium_metrics(embedding_reader, index, memory_available, ground_truth=None, eval_item_ids=None)[source]

Compute recall@R and intersection recall@R of an index

Return type

Dict[str, float]

autofaiss.external.scores.get_ground_truth(faiss_metric_type, embedding_reader, query_embeddings, memory_available)[source]

compute the ground truth (result with a perfect index) of the query on the embeddings

Module contents