# Vector Database A vector database built with pure-Python (no NumPy), FastAPI, and Cohere embeddings. ## Features - **Two Indexing Strategies**: Choose between blob-based clustering and HNSW (Hierarchical Navigable Small World) - **Cohere Integration**: We use Cohere's embed-v4.0 model with caching, for our vector embedding. - **Flexible Chunking**: Configurable text chunking with overlap. - **RESTful API**: Complete FastAPI-based REST API with automatic documentation - **Interactive UI**: Marimo-based web applications for easy database management - **Persistent Storage**: JSON-based file storage with configurable paths ## Architecture ### Core Concepts - **Libraries**: Collections of documents with configurable embedding and chunking settings - **Documents**: Text content that gets chunked and embedded for search - **Chunks**: Individual text segments with their vector embeddings - **Index Strategies**: Two available algorithms for efficient similarity search ### Indexing Algorithms 1. **Blob Index**: Hierarchical clustering with automatic splitting when clusters exceed size limits 2. **HNSW Index**: Graph-based approximate nearest neighbor search for high-performance queries ## Installation Set up Cohere API key in the environment variable file `.env` ```bash export CO_API_KEY="your-cohere-api-key" ``` To install you can do a `uv sync`, and to run the API you could just do a `just api` (we are using a Justfile). To run or modify the marimo notebooks you just run `just marimo src/notebook/.py` to run them as notebooks. ## Quick Usage Example: Using the requests library we can write a file with: ```python import requests API_URL = "http://localhost:8000" library_name = "my-documents" ``` and depending on what we want to do we can follow the following steps. #### Create A Library ```python """ Create a new library """ library_config = { "slug": library_name" "embedding_size": 512, "chunk_size": 256, "chunk_overlap": 64, "index": "blob" } response = requests.post(f"{API_URL}/library/create", json=library_config) ``` #### Inspect the library ```python response = requests.post(f"{API_URL}/library/{library_name}") ``` #### Add a Document ```python document = { "slug": "sample-doc", "content": "Your document content goes here..." } requests.post(f"{API_URL}/library/{library_name}/documents", json=document) ``` #### Do a Search ```python # Search the library query = { "query": "search terms", "results": 5 } response = requests.post(f"{API_URL}/library/{library_name}/search", json=query) results = response.json() ``` ## Configuration ### Library Settings - **embedding_size**: Vector dimensions (256, 512, 1024, 1536) - **chunk_size**: Maximum characters per text chunk - **chunk_overlap**: Overlap between consecutive chunks - **index**: Indexing algorithm ("blob" or "hnsw") - **index_blob_limit**: Maximum items per blob before splitting ### Environment Variables - **CO_API_KEY**: Your Cohere API key (required) - **COHERE_CACHE**: Enable/disable embedding caching (default: true) - **STORAGE_PATH**: Directory for data storage (default: "./data/") ## API Endpoints ### Libraries - `GET /library/list` - List all libraries - `POST /library/create` - Create a new library - `GET /library/{slug}` - Get library details ### Documents - `POST /library/{slug}/documents` - Add a document - `POST /library/{slug}/search` - Search documents ### Web Interface - `/` - Main dashboard with links to all UI components - `/docs` - Interactive API documentation - `/app/create` - Library creation interface - `/app/insert` - Document insertion interface - `/app/search` - Search interface - `/app/seed` - Database seeding with sample data ### Embedding Caching The system automatically caches Cohere embeddings to reduce API calls and improve performance. Cached embeddings are stored in the `.cohere_embedding` directory within your storage path. ## Development ### Project Structure ``` ├── api.py # FastAPI application and routes ├── embedding.py # Cohere integration and caching ├── index.py # Indexing strategies (Blob and HNSW) ├── model.py # Core data models and business logic ├── settings.py # Configuration management ├── utils.py # Utility functions and vector operations └── notebook/ # Marimo web applications ├── app-create.py ├── app-insert.py ├── app-search.py └── app-seed.py ``` ### Key Classes - **Library**: Main container for documents and configuration - **Document**: Represents a text document with metadata - **Chunk**: Individual text segment with embedding vector - **IndexStrategy**: Abstract interface for search algorithms - **BlobIndexStrategy**: Clustering-based search implementation - **HNSWIndexStrategy**: Graph-based search implementation