Autofaiss getting started

Information

This Demo notebook automatically creates a Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Github: https://github.com/criteo/autofaiss

Parameters

[1]:

#@title Index parameters

max_index_query_time_ms = 10 #@param {type: "number"}
max_index_memory_usage = "10MB" #@param
metric_type = "l2" #@param ['ip', 'l2']

Embeddings creation (add your own embeddings here)

[2]:

import numpy as np

# Create embeddings
embeddings = np.float32(np.random.rand(4000, 100))

Save your embeddings on the disk

[3]:

# Create a new folder
import os
import shutil
embeddings_dir = "embeddings_folder"
if os.path.exists(embeddings_dir):
  shutil.rmtree(embeddings_dir)
os.makedirs(embeddings_dir)

# Save your embeddings
# You can split you embeddings in several parts if it is too big
# The data will be read in the lexicographical order of the filenames
np.save(f"{embeddings_dir}/part1.npy", embeddings[:2000])
np.save(f"{embeddings_dir}/part2.npy", embeddings[2000:])

Build the KNN index with Autofaiss

[4]:

os.makedirs("my_index_folder", exist_ok=True)

[5]:

# Install autofaiss
!pip install autofaiss &> /dev/null

# Build a KNN index
!autofaiss build_index --embeddings={embeddings_dir} \
                    --index_path="knn.index" \
                    --index_infos_path="infos.json" \
                    --metric_type={metric_type} \
                    --max_index_query_time_ms=5 \
                    --max_index_memory_usage={max_index_memory_usage}

Launching the whole pipeline 08/02/2021, 13:25:58
        Compute estimated construction time of the index 08/02/2021, 13:25:58
                -> Train: 16.7 minutes
                -> Add: 0.0 seconds
                Total: 16.7 minutes
        >>> Finished "Compute estimated construction time of the index" in 0.0001 secs
        Checking that your have enough memory available to create the index 08/02/2021, 13:25:58
        >>> Finished "Checking that your have enough memory available to create the index" in 0.0006 secs
        Selecting most promising index types given data characteristics 08/02/2021, 13:25:58
        >>> Finished "Selecting most promising index types given data characteristics" in 0.0012 secs
        Creating the index 08/02/2021, 13:25:58
                -> Instanciate the index HNSW32 08/02/2021, 13:25:58
                >>> Finished "-> Instanciate the index HNSW32" in 0.0013 secs
                -> Extract training vectors 08/02/2021, 13:25:58
100% 2/2 [00:00<00:00, 1055.97it/s]
                >>> Finished "-> Extract training vectors" in 0.0138 secs
                -> Training the index with 4000 vectors of dim 100 08/02/2021, 13:25:58
                >>> Finished "-> Training the index with 4000 vectors of dim 100" in 0.0001 secs
                -> Adding the vectors to the index 08/02/2021, 13:25:58
100% 2/2 [00:00<00:00,  4.91it/s]
                >>> Finished "-> Adding the vectors to the index" in 1.7210 secs
        >>> Finished "Creating the index" in 1.7372 secs
        Computing best hyperparameters 08/02/2021, 13:26:00
        >>> Finished "Computing best hyperparameters" in 1.6057 secs
The best hyperparameters are: efSearch=1319
        Saving the index on local disk 08/02/2021, 13:26:01
        >>> Finished "Saving the index on local disk" in 0.0027 secs
        Compute fast metrics 08/02/2021, 13:26:01
2000
        >>> Finished "Compute fast metrics" in 9.8355 secs
Recap:
{'99p_search_speed_ms': 7.556187009999177,
 'avg_search_speed_ms': 4.902101082999792,
 'compression ratio': 0.5956986092671344,
 'nb vectors': 4000,
 'reconstruction error %': 0.0,
 'size in bytes': 2685922,
 'vectors dimension': 100}
>>> Finished "Launching the whole pipeline" in 13.1962 secs
Done

Load the index and play with it

[6]:

import faiss
import glob
import numpy as np

my_index = faiss.read_index("knn.index")

query_vector = np.float32(np.random.rand(1, 100))
k = 5
distances, indices = my_index.search(query_vector, k)

print(f"Top {k} elements in the dataset for max inner product search:")
for i, (dist, indice) in enumerate(zip(distances[0], indices[0])):
  print(f"{i+1}: Vector number {indice:4} with distance {dist}")

Top 5 elements in the dataset for max inner product search:
1: Vector number 2933 with distance 10.404068946838379
2: Vector number  168 with distance 10.53512191772461
3: Vector number 2475 with distance 10.688979148864746
4: Vector number 2525 with distance 10.713528633117676
5: Vector number 3463 with distance 10.774477005004883

(Bonus) Python version of the CLI

[7]:

from autofaiss import build_index

build_index(embeddings="embeddings_folder",
                   index_path="knn.index",
                   index_infos_path="infos.json",
                   max_index_query_time_ms = max_index_query_time_ms,
                   max_index_memory_usage = max_index_memory_usage,
                   metric_type=metric_type)

Launching the whole pipeline 08/02/2021, 13:26:11
        Compute estimated construction time of the index 08/02/2021, 13:26:11
                -> Train: 16.7 minutes
                -> Add: 0.0 seconds
                Total: 16.7 minutes
        >>> Finished "Compute estimated construction time of the index" in 0.0007 secs
        Checking that your have enough memory available to create the index 08/02/2021, 13:26:11
        >>> Finished "Checking that your have enough memory available to create the index" in 0.0012 secs
        Selecting most promising index types given data characteristics 08/02/2021, 13:26:11
        >>> Finished "Selecting most promising index types given data characteristics" in 0.0043 secs
        Creating the index 08/02/2021, 13:26:11
                -> Instanciate the index HNSW32 08/02/2021, 13:26:11
                >>> Finished "-> Instanciate the index HNSW32" in 0.0021 secs
                -> Extract training vectors 08/02/2021, 13:26:11

100%|██████████| 2/2 [00:00<00:00, 421.77it/s]

                >>> Finished "-> Extract training vectors" in 0.0238 secs
                -> Training the index with 4000 vectors of dim 100 08/02/2021, 13:26:11
                >>> Finished "-> Training the index with 4000 vectors of dim 100" in 0.0000 secs
                -> Adding the vectors to the index 08/02/2021, 13:26:11


100%|██████████| 2/2 [00:00<00:00,  4.55it/s]

                >>> Finished "-> Adding the vectors to the index" in 1.7814 secs
        >>> Finished "Creating the index" in 1.8182 secs
        Computing best hyperparameters 08/02/2021, 13:26:13
        >>> Finished "Computing best hyperparameters" in 3.2071 secs
The best hyperparameters are: efSearch=2077
        Saving the index on local disk 08/02/2021, 13:26:16
        >>> Finished "Saving the index on local disk" in 0.0064 secs
        Compute fast metrics 08/02/2021, 13:26:16
1025
        >>> Finished "Compute fast metrics" in 10.0180 secs
Recap:
{'99p_search_speed_ms': 13.157404919996907,
 'avg_search_speed_ms': 9.750819220487383,
 'compression ratio': 0.5956986092671344,
 'nb vectors': 4000,
 'reconstruction error %': 0.0,
 'size in bytes': 2685922,
 'vectors dimension': 100}
>>> Finished "Launching the whole pipeline" in 15.0867 secs

[7]:

'Done'

[ ]: