FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. This guide demonstrates how to use Archil disks to store and share FAISS indices across multiple servers, enabling scalable vector search for AI applications like recommendation systems, semantic search, and similarity matching.

Create an Archil disk

First, follow the Archil Getting Started Guide to create an Archil disk that you want to use for storing your FAISS indices and embeddings.
# Mount your Archil disk
sudo mkdir -p /mnt/archil
sudo archil mount dsk-DISKID /mnt/archil --region aws-us-east-1 --auth-token TOKEN --shared

# Create the directory for FAISS data
sudo mkdir -p /mnt/archil/faiss
sudo chown -R $USER:$USER /mnt/archil/faiss
The --shared flag enables multiple servers to access the same vector indices simultaneously.

Install Dependencies

Create a Python virtual environment and install FAISS with dependencies:
cd /mnt/archil
mkdir -p faiss
cd faiss
python -m venv faiss-env
source faiss-env/bin/activate

# Install FAISS and related packages
pip install faiss-cpu  # or faiss-gpu for GPU support
pip install numpy sentence-transformers
pip install datasets pandas scikit-learn

Set Up Directory Structure

Create directories for organizing your vector search data:
mkdir -p /mnt/archil/faiss/{indices,embeddings,data,models}

Understanding FAISS Index Types

FAISS offers several index types, each optimized for different use cases. Choosing the right index depends on your dataset size, memory constraints, accuracy requirements, and search speed needs.
IndexFlatL2 and IndexFlatIP provide exact search results by comparing the query against every vector in the database.When to use: Small to medium datasets (< 1M vectors), when you need 100% accuracyCharacteristics:
  • Memory usage: 4 bytes × dimension × number of vectors
  • Search time: Linear with dataset size
  • Accuracy: Perfect (exact results)
  • Training required: No
# Best for: High accuracy requirements, smaller datasets
index = faiss.IndexFlatL2(dimension)  # L2 distance
# or
index = faiss.IndexFlatIP(dimension)  # Inner product (cosine similarity)
Pros: Perfect accuracy, simple to use, no training required Cons: Slow for large datasets, high memory usage

Build a Vector Search System

Create a Python script for your FAISS-based vector search:
# vector_search.py
import os
import numpy as np
import faiss
import pickle
from sentence_transformers import SentenceTransformer
from typing import List, Tuple, Dict
import json

# Configure paths on Archil disk
INDICES_PATH = "/mnt/archil/faiss/indices"
EMBEDDINGS_PATH = "/mnt/archil/faiss/embeddings"
DATA_PATH = "/mnt/archil/faiss/data"
MODELS_PATH = "/mnt/archil/faiss/models"

class ArchilVectorSearch:
    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
        # Initialize embedding model (cached on Archil disk)
        self.model = SentenceTransformer(
            model_name, 
            cache_folder=MODELS_PATH
        )
        self.dimension = self.model.get_sentence_embedding_dimension()
        
        # Initialize FAISS index
        self.index = None
        self.documents = []
        self.metadata = []
        
    def create_index(self, index_type="flat"):
        """Create a new FAISS index"""
        if index_type == "flat":
            # Exact search (L2 distance)
            self.index = faiss.IndexFlatL2(self.dimension)
        elif index_type == "ivf":
            # Approximate search with IVF (Inverted File)
            quantizer = faiss.IndexFlatL2(self.dimension)
            self.index = faiss.IndexIVFFlat(quantizer, self.dimension, 100)
        elif index_type == "hnsw":
            # Hierarchical Navigable Small World
            self.index = faiss.IndexHNSWFlat(self.dimension, 32)
        else:
            raise ValueError(f"Unknown index type: {index_type}")
            
        print(f"Created {index_type} index with dimension {self.dimension}")
    
    def add_documents(self, documents: List[str], metadata: List[Dict] = None):
        """Add documents to the vector index"""
        print(f"Encoding {len(documents)} documents...")
        
        # Generate embeddings
        embeddings = self.model.encode(documents, show_progress_bar=True)
        embeddings = embeddings.astype('float32')
        
        # Add to index
        if self.index.ntotal == 0 and hasattr(self.index, 'train'):
            print("Training index...")
            self.index.train(embeddings)
        
        self.index.add(embeddings)
        
        # Store documents and metadata
        self.documents.extend(documents)
        if metadata:
            self.metadata.extend(metadata)
        else:
            self.metadata.extend([{"id": i} for i in range(len(documents))])
        
        print(f"Added {len(documents)} documents. Total: {self.index.ntotal}")
    
    def search(self, query: str, k: int = 5) -> List[Tuple[str, float, Dict]]:
        """Search for similar documents"""
        if self.index is None or self.index.ntotal == 0:
            return []
        
        # Encode query
        query_embedding = self.model.encode([query]).astype('float32')
        
        # Search
        distances, indices = self.index.search(query_embedding, k)
        
        # Format results
        results = []
        for i, (distance, idx) in enumerate(zip(distances[0], indices[0])):
            if idx < len(self.documents):
                results.append((
                    self.documents[idx],
                    float(distance),
                    self.metadata[idx]
                ))
        
        return results
    
    def save_index(self, index_name: str):
        """Save index and metadata to Archil disk"""
        index_path = os.path.join(INDICES_PATH, f"{index_name}.index")
        metadata_path = os.path.join(INDICES_PATH, f"{index_name}_metadata.pkl")
        documents_path = os.path.join(INDICES_PATH, f"{index_name}_documents.pkl")
        
        # Save FAISS index
        faiss.write_index(self.index, index_path)
        
        # Save metadata and documents
        with open(metadata_path, 'wb') as f:
            pickle.dump(self.metadata, f)
        
        with open(documents_path, 'wb') as f:
            pickle.dump(self.documents, f)
        
        print(f"Index saved to {index_path}")
    
    def load_index(self, index_name: str):
        """Load index and metadata from Archil disk"""
        index_path = os.path.join(INDICES_PATH, f"{index_name}.index")
        metadata_path = os.path.join(INDICES_PATH, f"{index_name}_metadata.pkl")
        documents_path = os.path.join(INDICES_PATH, f"{index_name}_documents.pkl")
        
        if not os.path.exists(index_path):
            raise FileNotFoundError(f"Index not found: {index_path}")
        
        # Load FAISS index
        self.index = faiss.read_index(index_path)
        
        # Load metadata and documents
        with open(metadata_path, 'rb') as f:
            self.metadata = pickle.load(f)
        
        with open(documents_path, 'rb') as f:
            self.documents = pickle.load(f)
        
        print(f"Loaded index with {self.index.ntotal} vectors")

def create_sample_dataset():
    """Create a sample dataset for demonstration"""
    documents = [
        "Machine learning is a subset of artificial intelligence",
        "Deep learning uses neural networks with multiple layers",
        "Natural language processing helps computers understand text",
        "Computer vision enables machines to interpret visual information",
        "Reinforcement learning trains agents through rewards and penalties",
        "Supervised learning uses labeled data for training",
        "Unsupervised learning finds patterns in unlabeled data",
        "Transfer learning adapts pre-trained models to new tasks",
        "Feature engineering improves model performance",
        "Cross-validation helps evaluate model generalization"
    ]
    
    metadata = [
        {"category": "AI", "topic": "ML Basics", "id": i} 
        for i in range(len(documents))
    ]
    
    return documents, metadata

if __name__ == "__main__":
    # Initialize vector search system
    vs = ArchilVectorSearch()
    
    # Create sample dataset
    documents, metadata = create_sample_dataset()
    
    # Create and populate index
    vs.create_index("flat")
    vs.add_documents(documents, metadata)
    
    # Save to Archil disk
    vs.save_index("ml_concepts")
    
    # Demonstrate search
    queries = [
        "neural networks and deep learning",
        "training with labeled examples",
        "computer understanding of images"
    ]
    
    for query in queries:
        print(f"\nQuery: {query}")
        results = vs.search(query, k=3)
        
        for i, (doc, distance, meta) in enumerate(results, 1):
            print(f"{i}. Distance: {distance:.4f}")
            print(f"   Document: {doc}")
            print(f"   Metadata: {meta}")

Load Sample Data

Create a script to load and index a larger dataset:
# load_dataset.py
from datasets import load_dataset
from vector_search import ArchilVectorSearch

def load_wikipedia_data():
    """Load Wikipedia dataset for indexing"""
    # Load a subset of Wikipedia articles
    try:
        dataset = load_dataset("wikipedia", "20220301.simple", split="train[:1000]")
    except Exception as e:
        print(f"Error loading dataset: {e}")
        print("Try reducing the dataset size or check your internet connection")
        return [], []
    
    documents = []
    metadata = []
    
    for i, article in enumerate(dataset):
        # Use article text (truncated for demo)
        text = article['text'][:1000]  # First 1000 characters
        documents.append(text)
        
        metadata.append({
            "title": article['title'],
            "id": article['id'],
            "url": article['url']
        })
    
    return documents, metadata

if __name__ == "__main__":
    # Initialize vector search
    vs = ArchilVectorSearch()
    
    # Load Wikipedia data
    print("Loading Wikipedia dataset...")
    documents, metadata = load_wikipedia_data()
    
    # Create IVF index for better performance with large datasets
    vs.create_index("ivf")
    vs.add_documents(documents, metadata)
    
    # Save to shared storage
    vs.save_index("wikipedia_1k")
    print("Wikipedia index created and saved to Archil disk")

Multi-Server Search Service

Create a simple search service that multiple servers can run:
# search_service.py
from vector_search import ArchilVectorSearch
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
import urllib.parse

class SearchHandler(BaseHTTPRequestHandler):
    def __init__(self, *args, vector_search=None, **kwargs):
        self.vector_search = vector_search
        super().__init__(*args, **kwargs)
    
    def do_GET(self):
        if self.path.startswith('/search'):
            # Parse query parameters
            parsed = urllib.parse.urlparse(self.path)
            params = urllib.parse.parse_qs(parsed.query)
            
            query = params.get('q', [''])[0]
            k = int(params.get('k', ['5'])[0])
            
            if query:
                results = self.vector_search.search(query, k)
                response = {
                    "query": query,
                    "results": [
                        {
                            "document": doc,
                            "distance": dist,
                            "metadata": meta
                        }
                        for doc, dist, meta in results
                    ]
                }
            else:
                response = {"error": "No query provided"}
            
            self.send_response(200)
            self.send_header('Content-type', 'application/json')
            self.end_headers()
            self.wfile.write(json.dumps(response, indent=2).encode())
        else:
            self.send_response(404)
            self.end_headers()

def run_search_service(index_name="ml_concepts", port=8000):
    # Load index from Archil disk
    vs = ArchilVectorSearch()
    vs.load_index(index_name)
    
    # Create handler with vector search instance
    handler = lambda *args, **kwargs: SearchHandler(*args, vector_search=vs, **kwargs)
    
    # Start server
    server = HTTPServer(('0.0.0.0', port), handler)
    print(f"Search service running on port {port}")
    print(f"Try: curl 'http://localhost:{port}/search?q=neural%20networks&k=3'")
    server.serve_forever()

if __name__ == "__main__":
    run_search_service()

Performance Optimization

Index Types Comparison

# benchmark_indices.py
import time
from vector_search import ArchilVectorSearch, create_sample_dataset

def benchmark_index_types():
    documents, metadata = create_sample_dataset()
    
    index_types = ["flat", "ivf", "hnsw"]
    
    for index_type in index_types:
        print(f"\nBenchmarking {index_type} index:")
        
        vs = ArchilVectorSearch()
        vs.create_index(index_type)
        
        # Time index creation
        start = time.time()
        vs.add_documents(documents, metadata)
        build_time = time.time() - start
        
        # Time search
        query = "machine learning algorithms"
        start = time.time()
        results = vs.search(query, k=5)
        search_time = time.time() - start
        
        print(f"  Build time: {build_time:.4f}s")
        print(f"  Search time: {search_time:.4f}s")
        print(f"  Results: {len(results)}")

if __name__ == "__main__":
    benchmark_index_types()

Monitoring and Maintenance

Monitor your FAISS indices on Archil:
# Check index files
ls -la /mnt/archil/faiss/indices/

# Monitor disk usage
du -sh /mnt/archil/faiss/*

# Check Archil status
archil status /mnt/archil

Advanced Features

Batch Processing

def batch_add_documents(vs, documents, batch_size=1000):
    """Add documents in batches for memory efficiency"""
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i+batch_size]
        vs.add_documents(batch)
        print(f"Processed batch {i//batch_size + 1}")

Index Merging

def merge_indices(index_names, output_name):
    """Merge multiple FAISS indices"""
    vs = ArchilVectorSearch()
    vs.create_index("flat")
    
    for index_name in index_names:
        temp_vs = ArchilVectorSearch()
        temp_vs.load_index(index_name)
        
        # Add documents from loaded index
        vs.add_documents(temp_vs.documents, temp_vs.metadata)
    
    vs.save_index(output_name)
    print(f"Merged {len(index_names)} indices into {output_name}")

Benefits of FAISS with Archil

  • Shared Indices: Multiple servers access the same vector indices without duplication
  • Fast Loading: Indices load quickly from Archil’s high-speed cache
  • Scalable Storage: Store large vector databases that exceed local disk capacity
  • Multi-Model Support: Share different embedding models across your infrastructure
  • Cost Efficient: Eliminate the need to replicate large indices across servers

Next Steps

  • Explore GPU acceleration with faiss-gpu
  • Implement real-time index updates
  • Add index compression techniques
  • Integrate with production search systems
  • Scale to billion-vector datasets
This setup provides a robust foundation for building production vector search systems with FAISS and Archil’s shared storage capabilities.