1
0
python-vector-database/README.md

169 lines
4.8 KiB
Markdown

# Vector Database
A vector database built with pure-Python (no NumPy), FastAPI, and Cohere embeddings.
## Features
- **Two Indexing Strategies**: Choose between blob-based clustering and HNSW (Hierarchical Navigable Small World)
- **Cohere Integration**: We use Cohere's embed-v4.0 model with caching, for our vector embedding.
- **Flexible Chunking**: Configurable text chunking with overlap.
- **RESTful API**: Complete FastAPI-based REST API with automatic documentation
- **Interactive UI**: Marimo-based web applications for easy database management
- **Persistent Storage**: JSON-based file storage with configurable paths
## Architecture
### Core Concepts
- **Libraries**: Collections of documents with configurable embedding and chunking settings
- **Documents**: Text content that gets chunked and embedded for search
- **Chunks**: Individual text segments with their vector embeddings
- **Index Strategies**: Two available algorithms for efficient similarity search
### Indexing Algorithms
1. **Blob Index**: Hierarchical clustering with automatic splitting when clusters exceed size limits
2. **HNSW Index**: Graph-based approximate nearest neighbor search for high-performance queries
## Installation
Set up Cohere API key in the environment variable file `.env`
```bash
export CO_API_KEY="your-cohere-api-key"
```
To install you can do a `uv sync`, and to run the API you could just do a `just api` (we are using a Justfile).
To run or modify the marimo notebooks you just run `just marimo src/notebook/<notebook-name>.py` to run them as notebooks.
## Quick Usage Example:
Using the requests library we can write a file with:
```python
import requests
API_URL = "http://localhost:8000"
library_name = "my-documents"
```
and depending on what we want to do we can follow the following steps.
#### Create A Library
```python
"""
Create a new library
"""
library_config = {
"slug": library_name"
"embedding_size": 512,
"chunk_size": 256,
"chunk_overlap": 64,
"index": "blob"
}
response = requests.post(f"{API_URL}/library/create", json=library_config)
```
#### Inspect the library
```python
response = requests.post(f"{API_URL}/library/{library_name}")
```
#### Add a Document
```python
document = {
"slug": "sample-doc",
"content": "Your document content goes here..."
}
requests.post(f"{API_URL}/library/{library_name}/documents", json=document)
```
#### Do a Search
```python
# Search the library
query = {
"query": "search terms",
"results": 5
}
response = requests.post(f"{API_URL}/library/{library_name}/search", json=query)
results = response.json()
```
## Configuration
### Library Settings
- **embedding_size**: Vector dimensions (256, 512, 1024, 1536)
- **chunk_size**: Maximum characters per text chunk
- **chunk_overlap**: Overlap between consecutive chunks
- **index**: Indexing algorithm ("blob" or "hnsw")
- **index_blob_limit**: Maximum items per blob before splitting
### Environment Variables
- **CO_API_KEY**: Your Cohere API key (required)
- **COHERE_CACHE**: Enable/disable embedding caching (default: true)
- **STORAGE_PATH**: Directory for data storage (default: "./data/")
## API Endpoints
### Libraries
- `GET /library/list` - List all libraries
- `POST /library/create` - Create a new library
- `GET /library/{slug}` - Get library details
### Documents
- `POST /library/{slug}/documents` - Add a document
- `POST /library/{slug}/search` - Search documents
### Web Interface
- `/` - Main dashboard with links to all UI components
- `/docs` - Interactive API documentation
- `/app/create` - Library creation interface
- `/app/insert` - Document insertion interface
- `/app/search` - Search interface
- `/app/seed` - Database seeding with sample data
### Embedding Caching
The system automatically caches Cohere embeddings to reduce API calls and improve performance. Cached embeddings are stored in the `.cohere_embedding` directory within your storage path.
## Development
### Project Structure
```
├── api.py # FastAPI application and routes
├── embedding.py # Cohere integration and caching
├── index.py # Indexing strategies (Blob and HNSW)
├── model.py # Core data models and business logic
├── settings.py # Configuration management
├── utils.py # Utility functions and vector operations
└── notebook/ # Marimo web applications
├── app-create.py
├── app-insert.py
├── app-search.py
└── app-seed.py
```
### Key Classes
- **Library**: Main container for documents and configuration
- **Document**: Represents a text document with metadata
- **Chunk**: Individual text segment with embedding vector
- **IndexStrategy**: Abstract interface for search algorithms
- **BlobIndexStrategy**: Clustering-based search implementation
- **HNSWIndexStrategy**: Graph-based search implementation