Create an Archil disk
First, follow the Archil Getting Started Guide to create an Archil disk that you want to use for storing your FAISS indices and embeddings.Copy
Ask AI
# Mount your Archil disk
sudo mkdir -p /mnt/archil
sudo archil mount <disk-name> /mnt/archil --region aws-us-east-1 --auth-token TOKEN --shared
# Create the directory for FAISS data
sudo mkdir -p /mnt/archil/faiss
sudo chown -R $USER:$USER /mnt/archil/faiss
--shared flag enables multiple servers to access the same vector indices simultaneously.
Install Dependencies
Create a Python virtual environment and install FAISS with dependencies:Copy
Ask AI
cd /mnt/archil
mkdir -p faiss
cd faiss
python -m venv faiss-env
source faiss-env/bin/activate
# Install FAISS and related packages
pip install faiss-cpu # or faiss-gpu for GPU support
pip install numpy sentence-transformers
pip install datasets pandas scikit-learn
Set Up Directory Structure
Create directories for organizing your vector search data:Copy
Ask AI
mkdir -p /mnt/archil/faiss/{indices,embeddings,data,models}
Understanding FAISS Index Types
FAISS offers several index types, each optimized for different use cases. Choosing the right index depends on your dataset size, memory constraints, accuracy requirements, and search speed needs.- Flat Indexes
- IVF Indexes
- HNSW Indexes
- Compressed Indexes
- Decision Guide
- Resources
IndexFlatL2 and IndexFlatIP provide exact search results by comparing the query against every vector in the database.When to use: Small to medium datasets (< 1M vectors), when you need 100% accuracyCharacteristics:Pros: Perfect accuracy, simple to use, no training required
Cons: Slow for large datasets, high memory usage
- Memory usage: 4 bytes × dimension × number of vectors
- Search time: Linear with dataset size
- Accuracy: Perfect (exact results)
- Training required: No
Copy
Ask AI
# Best for: High accuracy requirements, smaller datasets
index = faiss.IndexFlatL2(dimension) # L2 distance
# or
index = faiss.IndexFlatIP(dimension) # Inner product (cosine similarity)
IndexIVFFlat uses k-means clustering to partition the vector space, then searches only the most relevant partitions.When to use: Medium to large datasets (1M-100M vectors), when you can trade some accuracy for speedCharacteristics:Rule of thumb for nlist: Use
- Memory usage: 4 bytes × dimension + 8 bytes per vector
- Search time: Depends on
nprobeparameter (number of clusters to search) - Accuracy: High (90-99% depending on
nprobe) - Training required: Yes (needs representative sample to learn clusters)
Copy
Ask AI
# Best for: Balanced speed/accuracy, most common choice for large datasets
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist=100) # 100 clusters
index.nprobe = 10 # Search 10 clusters (higher = more accurate but slower)
sqrt(n) clusters where n is your dataset size. For 1M vectors, use ~1000 clusters.Pros: Good speed/accuracy balance, scalable to large datasets
Cons: Requires training, parameter tuning neededIndexHNSWFlat builds a graph structure that enables very fast approximate search with high accuracy.When to use: When you need the fastest possible search with high accuracyCharacteristics:HNSW Parameters:
- Memory usage: 4 bytes × dimension + additional graph overhead (~50-100 bytes per vector)
- Search time: Logarithmic with dataset size
- Accuracy: Very high (95-99%)
- Training required: No
Copy
Ask AI
# Best for: Real-time applications requiring fast search
index = faiss.IndexHNSWFlat(dimension, M=32) # M = connections per node
index.hnsw.efConstruction = 200 # Build-time search depth
index.hnsw.efSearch = 100 # Query-time search depth
M: Number of graph connections (16-64, higher = more accurate but more memory)efConstruction: Build-time exploration depth (100-800)efSearch: Query-time exploration depth (adjust for speed/accuracy trade-off)
For very large datasets where memory is a constraint:IndexIVFPQ (Product Quantization): Compresses vectors to ~8-32 bytes eachIndexScalarQuantizer: Reduces precision to 4-8 bits per dimensionWhen to use: Very large datasets (>10M vectors) with memory constraintsPros: Significant memory savings, still reasonably fast
Cons: Reduced accuracy, more complex setup
Copy
Ask AI
# Best for: Very large datasets with memory constraints
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist=1000, m=8, nbits=8)
Copy
Ask AI
# Best for: Moderate compression with good accuracy
index = faiss.IndexScalarQuantizer(dimension, faiss.ScalarQuantizer.QT_8bit)
Decision Matrix
| Dataset Size | Memory Priority | Speed Priority | Accuracy Priority | Recommended Index |
|---|---|---|---|---|
| < 100K | Any | Any | Exact | IndexFlatL2 |
| 100K - 1M | Low | High | High | IndexHNSWFlat |
| 100K - 1M | High | Medium | High | IndexIVFFlat |
| 1M - 10M | Low | High | High | IndexHNSWFlat |
| 1M - 10M | High | Medium | Medium | IndexIVFPQ |
| > 10M | High | Medium | Medium | IndexIVFPQ |
| > 100M | Very High | Low | Low | IndexIVFPQ with aggressive compression |
Performance Comparison
Here’s what you can expect for a 1M vector dataset (128D):| Index Type | Build Time | Memory Usage | Search Time (1-NN) | Accuracy |
|---|---|---|---|---|
| IndexFlatL2 | 1s | 512MB | 50ms | 100% |
| IndexIVFFlat | 30s | 520MB | 2ms | 95-99% |
| IndexHNSWFlat | 120s | 600MB | 0.5ms | 95-99% |
| IndexIVFPQ | 45s | 80MB | 3ms | 85-95% |
For deeper understanding of FAISS index types:Official FAISS Documentation:Research Papers:
- Billion-scale similarity search with GPUs - Original FAISS paper
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs - HNSW algorithm
Build a Vector Search System
Create a Python script for your FAISS-based vector search:Copy
Ask AI
# vector_search.py
import os
import numpy as np
import faiss
import pickle
from sentence_transformers import SentenceTransformer
from typing import List, Tuple, Dict
import json
# Configure paths on Archil disk
INDICES_PATH = "/mnt/archil/faiss/indices"
EMBEDDINGS_PATH = "/mnt/archil/faiss/embeddings"
DATA_PATH = "/mnt/archil/faiss/data"
MODELS_PATH = "/mnt/archil/faiss/models"
class ArchilVectorSearch:
def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
# Initialize embedding model (cached on Archil disk)
self.model = SentenceTransformer(
model_name,
cache_folder=MODELS_PATH
)
self.dimension = self.model.get_sentence_embedding_dimension()
# Initialize FAISS index
self.index = None
self.documents = []
self.metadata = []
def create_index(self, index_type="flat"):
"""Create a new FAISS index"""
if index_type == "flat":
# Exact search (L2 distance)
self.index = faiss.IndexFlatL2(self.dimension)
elif index_type == "ivf":
# Approximate search with IVF (Inverted File)
quantizer = faiss.IndexFlatL2(self.dimension)
self.index = faiss.IndexIVFFlat(quantizer, self.dimension, 100)
elif index_type == "hnsw":
# Hierarchical Navigable Small World
self.index = faiss.IndexHNSWFlat(self.dimension, 32)
else:
raise ValueError(f"Unknown index type: {index_type}")
print(f"Created {index_type} index with dimension {self.dimension}")
def add_documents(self, documents: List[str], metadata: List[Dict] = None):
"""Add documents to the vector index"""
print(f"Encoding {len(documents)} documents...")
# Generate embeddings
embeddings = self.model.encode(documents, show_progress_bar=True)
embeddings = embeddings.astype('float32')
# Add to index
if self.index.ntotal == 0 and hasattr(self.index, 'train'):
print("Training index...")
self.index.train(embeddings)
self.index.add(embeddings)
# Store documents and metadata
self.documents.extend(documents)
if metadata:
self.metadata.extend(metadata)
else:
self.metadata.extend([{"id": i} for i in range(len(documents))])
print(f"Added {len(documents)} documents. Total: {self.index.ntotal}")
def search(self, query: str, k: int = 5) -> List[Tuple[str, float, Dict]]:
"""Search for similar documents"""
if self.index is None or self.index.ntotal == 0:
return []
# Encode query
query_embedding = self.model.encode([query]).astype('float32')
# Search
distances, indices = self.index.search(query_embedding, k)
# Format results
results = []
for i, (distance, idx) in enumerate(zip(distances[0], indices[0])):
if idx < len(self.documents):
results.append((
self.documents[idx],
float(distance),
self.metadata[idx]
))
return results
def save_index(self, index_name: str):
"""Save index and metadata to Archil disk"""
index_path = os.path.join(INDICES_PATH, f"{index_name}.index")
metadata_path = os.path.join(INDICES_PATH, f"{index_name}_metadata.pkl")
documents_path = os.path.join(INDICES_PATH, f"{index_name}_documents.pkl")
# Save FAISS index
faiss.write_index(self.index, index_path)
# Save metadata and documents
with open(metadata_path, 'wb') as f:
pickle.dump(self.metadata, f)
with open(documents_path, 'wb') as f:
pickle.dump(self.documents, f)
print(f"Index saved to {index_path}")
def load_index(self, index_name: str):
"""Load index and metadata from Archil disk"""
index_path = os.path.join(INDICES_PATH, f"{index_name}.index")
metadata_path = os.path.join(INDICES_PATH, f"{index_name}_metadata.pkl")
documents_path = os.path.join(INDICES_PATH, f"{index_name}_documents.pkl")
if not os.path.exists(index_path):
raise FileNotFoundError(f"Index not found: {index_path}")
# Load FAISS index
self.index = faiss.read_index(index_path)
# Load metadata and documents
with open(metadata_path, 'rb') as f:
self.metadata = pickle.load(f)
with open(documents_path, 'rb') as f:
self.documents = pickle.load(f)
print(f"Loaded index with {self.index.ntotal} vectors")
def create_sample_dataset():
"""Create a sample dataset for demonstration"""
documents = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with multiple layers",
"Natural language processing helps computers understand text",
"Computer vision enables machines to interpret visual information",
"Reinforcement learning trains agents through rewards and penalties",
"Supervised learning uses labeled data for training",
"Unsupervised learning finds patterns in unlabeled data",
"Transfer learning adapts pre-trained models to new tasks",
"Feature engineering improves model performance",
"Cross-validation helps evaluate model generalization"
]
metadata = [
{"category": "AI", "topic": "ML Basics", "id": i}
for i in range(len(documents))
]
return documents, metadata
if __name__ == "__main__":
# Initialize vector search system
vs = ArchilVectorSearch()
# Create sample dataset
documents, metadata = create_sample_dataset()
# Create and populate index
vs.create_index("flat")
vs.add_documents(documents, metadata)
# Save to Archil disk
vs.save_index("ml_concepts")
# Demonstrate search
queries = [
"neural networks and deep learning",
"training with labeled examples",
"computer understanding of images"
]
for query in queries:
print(f"\nQuery: {query}")
results = vs.search(query, k=3)
for i, (doc, distance, meta) in enumerate(results, 1):
print(f"{i}. Distance: {distance:.4f}")
print(f" Document: {doc}")
print(f" Metadata: {meta}")
Load Sample Data
Create a script to load and index a larger dataset:Copy
Ask AI
# load_dataset.py
from datasets import load_dataset
from vector_search import ArchilVectorSearch
def load_wikipedia_data():
"""Load Wikipedia dataset for indexing"""
# Load a subset of Wikipedia articles
try:
dataset = load_dataset("wikipedia", "20220301.simple", split="train[:1000]")
except Exception as e:
print(f"Error loading dataset: {e}")
print("Try reducing the dataset size or check your internet connection")
return [], []
documents = []
metadata = []
for i, article in enumerate(dataset):
# Use article text (truncated for demo)
text = article['text'][:1000] # First 1000 characters
documents.append(text)
metadata.append({
"title": article['title'],
"id": article['id'],
"url": article['url']
})
return documents, metadata
if __name__ == "__main__":
# Initialize vector search
vs = ArchilVectorSearch()
# Load Wikipedia data
print("Loading Wikipedia dataset...")
documents, metadata = load_wikipedia_data()
# Create IVF index for better performance with large datasets
vs.create_index("ivf")
vs.add_documents(documents, metadata)
# Save to shared storage
vs.save_index("wikipedia_1k")
print("Wikipedia index created and saved to Archil disk")
Multi-Server Search Service
Create a simple search service that multiple servers can run:Copy
Ask AI
# search_service.py
from vector_search import ArchilVectorSearch
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
import urllib.parse
class SearchHandler(BaseHTTPRequestHandler):
def __init__(self, *args, vector_search=None, **kwargs):
self.vector_search = vector_search
super().__init__(*args, **kwargs)
def do_GET(self):
if self.path.startswith('/search'):
# Parse query parameters
parsed = urllib.parse.urlparse(self.path)
params = urllib.parse.parse_qs(parsed.query)
query = params.get('q', [''])[0]
k = int(params.get('k', ['5'])[0])
if query:
results = self.vector_search.search(query, k)
response = {
"query": query,
"results": [
{
"document": doc,
"distance": dist,
"metadata": meta
}
for doc, dist, meta in results
]
}
else:
response = {"error": "No query provided"}
self.send_response(200)
self.send_header('Content-type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps(response, indent=2).encode())
else:
self.send_response(404)
self.end_headers()
def run_search_service(index_name="ml_concepts", port=8000):
# Load index from Archil disk
vs = ArchilVectorSearch()
vs.load_index(index_name)
# Create handler with vector search instance
handler = lambda *args, **kwargs: SearchHandler(*args, vector_search=vs, **kwargs)
# Start server
server = HTTPServer(('0.0.0.0', port), handler)
print(f"Search service running on port {port}")
print(f"Try: curl 'http://localhost:{port}/search?q=neural%20networks&k=3'")
server.serve_forever()
if __name__ == "__main__":
run_search_service()
Performance Optimization
Index Types Comparison
Copy
Ask AI
# benchmark_indices.py
import time
from vector_search import ArchilVectorSearch, create_sample_dataset
def benchmark_index_types():
documents, metadata = create_sample_dataset()
index_types = ["flat", "ivf", "hnsw"]
for index_type in index_types:
print(f"\nBenchmarking {index_type} index:")
vs = ArchilVectorSearch()
vs.create_index(index_type)
# Time index creation
start = time.time()
vs.add_documents(documents, metadata)
build_time = time.time() - start
# Time search
query = "machine learning algorithms"
start = time.time()
results = vs.search(query, k=5)
search_time = time.time() - start
print(f" Build time: {build_time:.4f}s")
print(f" Search time: {search_time:.4f}s")
print(f" Results: {len(results)}")
if __name__ == "__main__":
benchmark_index_types()
Monitoring and Maintenance
Monitor your FAISS indices on Archil:Copy
Ask AI
# Check index files
ls -la /mnt/archil/faiss/indices/
# Monitor disk usage
du -sh /mnt/archil/faiss/*
# Check Archil status
archil status /mnt/archil
Advanced Features
Batch Processing
Copy
Ask AI
def batch_add_documents(vs, documents, batch_size=1000):
"""Add documents in batches for memory efficiency"""
for i in range(0, len(documents), batch_size):
batch = documents[i:i+batch_size]
vs.add_documents(batch)
print(f"Processed batch {i//batch_size + 1}")
Index Merging
Copy
Ask AI
def merge_indices(index_names, output_name):
"""Merge multiple FAISS indices"""
vs = ArchilVectorSearch()
vs.create_index("flat")
for index_name in index_names:
temp_vs = ArchilVectorSearch()
temp_vs.load_index(index_name)
# Add documents from loaded index
vs.add_documents(temp_vs.documents, temp_vs.metadata)
vs.save_index(output_name)
print(f"Merged {len(index_names)} indices into {output_name}")
Benefits of FAISS with Archil
- Shared Indices: Multiple servers access the same vector indices without duplication
- Fast Loading: Indices load quickly from Archil’s high-speed cache
- Scalable Storage: Store large vector databases that exceed local disk capacity
- Multi-Model Support: Share different embedding models across your infrastructure
- Cost Efficient: Eliminate the need to replicate large indices across servers
Next Steps
- Explore GPU acceleration with
faiss-gpu - Implement real-time index updates
- Add index compression techniques
- Integrate with production search systems
- Scale to billion-vector datasets