Autofaiss getting started
Information
This Demo notebook automatically creates a Faiss knn indices with the most optimal similarity search parameters.
It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.
Parameters
[1]:
#@title Index parameters
max_index_query_time_ms = 10 #@param {type: "number"}
max_index_memory_usage = "10MB" #@param
metric_type = "l2" #@param ['ip', 'l2']
Embeddings creation (add your own embeddings here)
[2]:
import numpy as np
# Create embeddings
embeddings = np.float32(np.random.rand(4000, 100))
Save your embeddings on the disk
[3]:
# Create a new folder
import os
import shutil
embeddings_dir = "embeddings_folder"
if os.path.exists(embeddings_dir):
shutil.rmtree(embeddings_dir)
os.makedirs(embeddings_dir)
# Save your embeddings
# You can split you embeddings in several parts if it is too big
# The data will be read in the lexicographical order of the filenames
np.save(f"{embeddings_dir}/part1.npy", embeddings[:2000])
np.save(f"{embeddings_dir}/part2.npy", embeddings[2000:])
Build the KNN index with Autofaiss
[4]:
os.makedirs("my_index_folder", exist_ok=True)
[5]:
# Install autofaiss
!pip install autofaiss &> /dev/null
# Build a KNN index
!autofaiss build_index --embeddings={embeddings_dir} \
--index_path="knn.index" \
--index_infos_path="infos.json" \
--metric_type={metric_type} \
--max_index_query_time_ms=5 \
--max_index_memory_usage={max_index_memory_usage}
Launching the whole pipeline 08/02/2021, 13:25:58
Compute estimated construction time of the index 08/02/2021, 13:25:58
-> Train: 16.7 minutes
-> Add: 0.0 seconds
Total: 16.7 minutes
>>> Finished "Compute estimated construction time of the index" in 0.0001 secs
Checking that your have enough memory available to create the index 08/02/2021, 13:25:58
>>> Finished "Checking that your have enough memory available to create the index" in 0.0006 secs
Selecting most promising index types given data characteristics 08/02/2021, 13:25:58
>>> Finished "Selecting most promising index types given data characteristics" in 0.0012 secs
Creating the index 08/02/2021, 13:25:58
-> Instanciate the index HNSW32 08/02/2021, 13:25:58
>>> Finished "-> Instanciate the index HNSW32" in 0.0013 secs
-> Extract training vectors 08/02/2021, 13:25:58
100% 2/2 [00:00<00:00, 1055.97it/s]
>>> Finished "-> Extract training vectors" in 0.0138 secs
-> Training the index with 4000 vectors of dim 100 08/02/2021, 13:25:58
>>> Finished "-> Training the index with 4000 vectors of dim 100" in 0.0001 secs
-> Adding the vectors to the index 08/02/2021, 13:25:58
100% 2/2 [00:00<00:00, 4.91it/s]
>>> Finished "-> Adding the vectors to the index" in 1.7210 secs
>>> Finished "Creating the index" in 1.7372 secs
Computing best hyperparameters 08/02/2021, 13:26:00
>>> Finished "Computing best hyperparameters" in 1.6057 secs
The best hyperparameters are: efSearch=1319
Saving the index on local disk 08/02/2021, 13:26:01
>>> Finished "Saving the index on local disk" in 0.0027 secs
Compute fast metrics 08/02/2021, 13:26:01
2000
>>> Finished "Compute fast metrics" in 9.8355 secs
Recap:
{'99p_search_speed_ms': 7.556187009999177,
'avg_search_speed_ms': 4.902101082999792,
'compression ratio': 0.5956986092671344,
'nb vectors': 4000,
'reconstruction error %': 0.0,
'size in bytes': 2685922,
'vectors dimension': 100}
>>> Finished "Launching the whole pipeline" in 13.1962 secs
Done
Load the index and play with it
[6]:
import faiss
import glob
import numpy as np
my_index = faiss.read_index("knn.index")
query_vector = np.float32(np.random.rand(1, 100))
k = 5
distances, indices = my_index.search(query_vector, k)
print(f"Top {k} elements in the dataset for max inner product search:")
for i, (dist, indice) in enumerate(zip(distances[0], indices[0])):
print(f"{i+1}: Vector number {indice:4} with distance {dist}")
Top 5 elements in the dataset for max inner product search:
1: Vector number 2933 with distance 10.404068946838379
2: Vector number 168 with distance 10.53512191772461
3: Vector number 2475 with distance 10.688979148864746
4: Vector number 2525 with distance 10.713528633117676
5: Vector number 3463 with distance 10.774477005004883
(Bonus) Python version of the CLI
[7]:
from autofaiss import build_index
build_index(embeddings="embeddings_folder",
index_path="knn.index",
index_infos_path="infos.json",
max_index_query_time_ms = max_index_query_time_ms,
max_index_memory_usage = max_index_memory_usage,
metric_type=metric_type)
Launching the whole pipeline 08/02/2021, 13:26:11
Compute estimated construction time of the index 08/02/2021, 13:26:11
-> Train: 16.7 minutes
-> Add: 0.0 seconds
Total: 16.7 minutes
>>> Finished "Compute estimated construction time of the index" in 0.0007 secs
Checking that your have enough memory available to create the index 08/02/2021, 13:26:11
>>> Finished "Checking that your have enough memory available to create the index" in 0.0012 secs
Selecting most promising index types given data characteristics 08/02/2021, 13:26:11
>>> Finished "Selecting most promising index types given data characteristics" in 0.0043 secs
Creating the index 08/02/2021, 13:26:11
-> Instanciate the index HNSW32 08/02/2021, 13:26:11
>>> Finished "-> Instanciate the index HNSW32" in 0.0021 secs
-> Extract training vectors 08/02/2021, 13:26:11
100%|██████████| 2/2 [00:00<00:00, 421.77it/s]
>>> Finished "-> Extract training vectors" in 0.0238 secs
-> Training the index with 4000 vectors of dim 100 08/02/2021, 13:26:11
>>> Finished "-> Training the index with 4000 vectors of dim 100" in 0.0000 secs
-> Adding the vectors to the index 08/02/2021, 13:26:11
100%|██████████| 2/2 [00:00<00:00, 4.55it/s]
>>> Finished "-> Adding the vectors to the index" in 1.7814 secs
>>> Finished "Creating the index" in 1.8182 secs
Computing best hyperparameters 08/02/2021, 13:26:13
>>> Finished "Computing best hyperparameters" in 3.2071 secs
The best hyperparameters are: efSearch=2077
Saving the index on local disk 08/02/2021, 13:26:16
>>> Finished "Saving the index on local disk" in 0.0064 secs
Compute fast metrics 08/02/2021, 13:26:16
1025
>>> Finished "Compute fast metrics" in 10.0180 secs
Recap:
{'99p_search_speed_ms': 13.157404919996907,
'avg_search_speed_ms': 9.750819220487383,
'compression ratio': 0.5956986092671344,
'nb vectors': 4000,
'reconstruction error %': 0.0,
'size in bytes': 2685922,
'vectors dimension': 100}
>>> Finished "Launching the whole pipeline" in 15.0867 secs
[7]:
'Done'
[ ]: