python-vector-database/README.md

# Vector Database

A vector database built with pure-Python (no NumPy), FastAPI, and Cohere embeddings.

## Features

- **Two Indexing Strategies**: Choose between blob-based clustering and HNSW (Hierarchical Navigable Small World)
- **Cohere Integration**: We use Cohere's embed-v4.0 model with caching, for our vector embedding.
- **Flexible Chunking**: Configurable text chunking with overlap.
- **RESTful API**: Complete FastAPI-based REST API with automatic documentation
- **Interactive UI**: Marimo-based web applications for easy database management
- **Persistent Storage**: JSON-based file storage with configurable paths

## Architecture

### Core Concepts

- **Libraries**: Collections of documents with configurable embedding and chunking settings
- **Documents**: Text content that gets chunked and embedded for search
- **Chunks**: Individual text segments with their vector embeddings
- **Index Strategies**: Two available algorithms for efficient similarity search

### Indexing Algorithms

1. **Blob Index**: Hierarchical clustering with automatic splitting when clusters exceed size limits
2. **HNSW Index**: Graph-based approximate nearest neighbor search for high-performance queries

## Installation


Set up Cohere API key in the environment variable file `.env`

```bash
export CO_API_KEY="your-cohere-api-key"
```

To install you can do a `uv sync`, and to run the API you could just do a `just api` (we are using a Justfile).

To run or modify the marimo notebooks you just run `just marimo src/notebook/<notebook-name>.py` to run them as notebooks.


## Quick Usage Example:

Using the requests library we can write a file with:

```python
import requests
API_URL = "http://localhost:8000"
library_name = "my-documents"
```

and depending on what we want to do we can follow the following steps.


#### Create A Library

```python
"""
Create a new library
"""
library_config = {
    "slug": library_name"
    "embedding_size": 512,
    "chunk_size": 256,
    "chunk_overlap": 64,
    "index": "blob"
}

response = requests.post(f"{API_URL}/library/create", json=library_config)
```


#### Inspect the library

```python
response = requests.post(f"{API_URL}/library/{library_name}")
```

#### Add a Document

```python
document = {
    "slug": "sample-doc",
    "content": "Your document content goes here..."
}

requests.post(f"{API_URL}/library/{library_name}/documents", json=document)
```

#### Do a Search

```python
# Search the library
query = {
    "query": "search terms",
    "results": 5
}

response = requests.post(f"{API_URL}/library/{library_name}/search", json=query)
results = response.json()
```

## Configuration

### Library Settings

- **embedding_size**: Vector dimensions (256, 512, 1024, 1536)
- **chunk_size**: Maximum characters per text chunk
- **chunk_overlap**: Overlap between consecutive chunks
- **index**: Indexing algorithm ("blob" or "hnsw")
- **index_blob_limit**: Maximum items per blob before splitting

### Environment Variables

- **CO_API_KEY**: Your Cohere API key (required)
- **COHERE_CACHE**: Enable/disable embedding caching (default: true)
- **STORAGE_PATH**: Directory for data storage (default: "./data/")

## API Endpoints

### Libraries
- `GET /library/list` - List all libraries
- `POST /library/create` - Create a new library
- `GET /library/{slug}` - Get library details

### Documents
- `POST /library/{slug}/documents` - Add a document
- `POST /library/{slug}/search` - Search documents

### Web Interface
- `/` - Main dashboard with links to all UI components
- `/docs` - Interactive API documentation
- `/app/create` - Library creation interface
- `/app/insert` - Document insertion interface
- `/app/search` - Search interface
- `/app/seed` - Database seeding with sample data

### Embedding Caching

The system automatically caches Cohere embeddings to reduce API calls and improve performance. Cached embeddings are stored in the `.cohere_embedding` directory within your storage path.

## Development

### Project Structure

```
├── api.py             # FastAPI application and routes
├── embedding.py       # Cohere integration and caching
├── index.py           # Indexing strategies (Blob and HNSW)
├── model.py           # Core data models and business logic
├── settings.py        # Configuration management
├── utils.py           # Utility functions and vector operations
└── notebook/          # Marimo web applications
    ├── app-create.py
    ├── app-insert.py
    ├── app-search.py
    └── app-seed.py
```

### Key Classes

- **Library**: Main container for documents and configuration
- **Document**: Represents a text document with metadata
- **Chunk**: Individual text segment with embedding vector
- **IndexStrategy**: Abstract interface for search algorithms
- **BlobIndexStrategy**: Clustering-based search implementation
- **HNSWIndexStrategy**: Graph-based search implementation