169 lines
4.8 KiB
Markdown
169 lines
4.8 KiB
Markdown
# Vector Database
|
|
|
|
A vector database built with pure-Python (no NumPy), FastAPI, and Cohere embeddings.
|
|
|
|
## Features
|
|
|
|
- **Two Indexing Strategies**: Choose between blob-based clustering and HNSW (Hierarchical Navigable Small World)
|
|
- **Cohere Integration**: We use Cohere's embed-v4.0 model with caching, for our vector embedding.
|
|
- **Flexible Chunking**: Configurable text chunking with overlap.
|
|
- **RESTful API**: Complete FastAPI-based REST API with automatic documentation
|
|
- **Interactive UI**: Marimo-based web applications for easy database management
|
|
- **Persistent Storage**: JSON-based file storage with configurable paths
|
|
|
|
## Architecture
|
|
|
|
### Core Concepts
|
|
|
|
- **Libraries**: Collections of documents with configurable embedding and chunking settings
|
|
- **Documents**: Text content that gets chunked and embedded for search
|
|
- **Chunks**: Individual text segments with their vector embeddings
|
|
- **Index Strategies**: Two available algorithms for efficient similarity search
|
|
|
|
### Indexing Algorithms
|
|
|
|
1. **Blob Index**: Hierarchical clustering with automatic splitting when clusters exceed size limits
|
|
2. **HNSW Index**: Graph-based approximate nearest neighbor search for high-performance queries
|
|
|
|
## Installation
|
|
|
|
|
|
Set up Cohere API key in the environment variable file `.env`
|
|
|
|
```bash
|
|
export CO_API_KEY="your-cohere-api-key"
|
|
```
|
|
|
|
To install you can do a `uv sync`, and to run the API you could just do a `just api` (we are using a Justfile).
|
|
|
|
To run or modify the marimo notebooks you just run `just marimo src/notebook/<notebook-name>.py` to run them as notebooks.
|
|
|
|
|
|
## Quick Usage Example:
|
|
|
|
Using the requests library we can write a file with:
|
|
|
|
```python
|
|
import requests
|
|
API_URL = "http://localhost:8000"
|
|
library_name = "my-documents"
|
|
```
|
|
|
|
and depending on what we want to do we can follow the following steps.
|
|
|
|
|
|
#### Create A Library
|
|
|
|
```python
|
|
"""
|
|
Create a new library
|
|
"""
|
|
library_config = {
|
|
"slug": library_name"
|
|
"embedding_size": 512,
|
|
"chunk_size": 256,
|
|
"chunk_overlap": 64,
|
|
"index": "blob"
|
|
}
|
|
|
|
response = requests.post(f"{API_URL}/library/create", json=library_config)
|
|
```
|
|
|
|
|
|
#### Inspect the library
|
|
|
|
```python
|
|
response = requests.post(f"{API_URL}/library/{library_name}")
|
|
```
|
|
|
|
#### Add a Document
|
|
|
|
```python
|
|
document = {
|
|
"slug": "sample-doc",
|
|
"content": "Your document content goes here..."
|
|
}
|
|
|
|
requests.post(f"{API_URL}/library/{library_name}/documents", json=document)
|
|
```
|
|
|
|
#### Do a Search
|
|
|
|
```python
|
|
# Search the library
|
|
query = {
|
|
"query": "search terms",
|
|
"results": 5
|
|
}
|
|
|
|
response = requests.post(f"{API_URL}/library/{library_name}/search", json=query)
|
|
results = response.json()
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Library Settings
|
|
|
|
- **embedding_size**: Vector dimensions (256, 512, 1024, 1536)
|
|
- **chunk_size**: Maximum characters per text chunk
|
|
- **chunk_overlap**: Overlap between consecutive chunks
|
|
- **index**: Indexing algorithm ("blob" or "hnsw")
|
|
- **index_blob_limit**: Maximum items per blob before splitting
|
|
|
|
### Environment Variables
|
|
|
|
- **CO_API_KEY**: Your Cohere API key (required)
|
|
- **COHERE_CACHE**: Enable/disable embedding caching (default: true)
|
|
- **STORAGE_PATH**: Directory for data storage (default: "./data/")
|
|
|
|
## API Endpoints
|
|
|
|
### Libraries
|
|
- `GET /library/list` - List all libraries
|
|
- `POST /library/create` - Create a new library
|
|
- `GET /library/{slug}` - Get library details
|
|
|
|
### Documents
|
|
- `POST /library/{slug}/documents` - Add a document
|
|
- `POST /library/{slug}/search` - Search documents
|
|
|
|
### Web Interface
|
|
- `/` - Main dashboard with links to all UI components
|
|
- `/docs` - Interactive API documentation
|
|
- `/app/create` - Library creation interface
|
|
- `/app/insert` - Document insertion interface
|
|
- `/app/search` - Search interface
|
|
- `/app/seed` - Database seeding with sample data
|
|
|
|
### Embedding Caching
|
|
|
|
The system automatically caches Cohere embeddings to reduce API calls and improve performance. Cached embeddings are stored in the `.cohere_embedding` directory within your storage path.
|
|
|
|
## Development
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
├── api.py # FastAPI application and routes
|
|
├── embedding.py # Cohere integration and caching
|
|
├── index.py # Indexing strategies (Blob and HNSW)
|
|
├── model.py # Core data models and business logic
|
|
├── settings.py # Configuration management
|
|
├── utils.py # Utility functions and vector operations
|
|
└── notebook/ # Marimo web applications
|
|
├── app-create.py
|
|
├── app-insert.py
|
|
├── app-search.py
|
|
└── app-seed.py
|
|
```
|
|
|
|
### Key Classes
|
|
|
|
- **Library**: Main container for documents and configuration
|
|
- **Document**: Represents a text document with metadata
|
|
- **Chunk**: Individual text segment with embedding vector
|
|
- **IndexStrategy**: Abstract interface for search algorithms
|
|
- **BlobIndexStrategy**: Clustering-based search implementation
|
|
- **HNSWIndexStrategy**: Graph-based search implementation
|
|
|