Go to file

Miguel Alejandro Salgado Zapien efb3ea003c Corrected readme.		2025-05-30 15:00:59 -07:00
src	Corrected readme.	2025-05-30 15:00:59 -07:00
.dockerignore	Big Bang	2025-05-30 14:42:08 -07:00
.gitignore	Big Bang	2025-05-30 14:42:08 -07:00
.python-version	Big Bang	2025-05-30 14:42:08 -07:00
docker-compose.yaml	Big Bang	2025-05-30 14:42:08 -07:00
Dockerfile	Big Bang	2025-05-30 14:42:08 -07:00
Justfile	Corrected readme.	2025-05-30 15:00:59 -07:00
pyproject.toml	Big Bang	2025-05-30 14:42:08 -07:00
README.md	Corrected readme.	2025-05-30 15:00:59 -07:00
uv.lock	Big Bang	2025-05-30 14:42:08 -07:00

README.md

Vector Database

A vector database built with pure-Python (no NumPy), FastAPI, and Cohere embeddings.

Features

Two Indexing Strategies: Choose between blob-based clustering and HNSW (Hierarchical Navigable Small World)
Cohere Integration: We use Cohere’s embed-v4.0 model with caching, for our vector embedding.
Flexible Chunking: Configurable text chunking with overlap.
RESTful API: Complete FastAPI-based REST API with automatic documentation
Interactive UI: Marimo-based web applications for easy database management
Persistent Storage: JSON-based file storage with configurable paths

Architecture

Core Concepts

Libraries: Collections of documents with configurable embedding and chunking settings
Documents: Text content that gets chunked and embedded for search
Chunks: Individual text segments with their vector embeddings
Index Strategies: Two available algorithms for efficient similarity search

Indexing Algorithms

Blob Index: Hierarchical clustering with automatic splitting when clusters exceed size limits
HNSW Index: Graph-based approximate nearest neighbor search for high-performance queries

Installation

Set up Cohere API key in the environment variable file .env

export CO_API_KEY="your-cohere-api-key"

To install you can do a uv sync, and to run the API you could just do a just api (we are using a Justfile).

To run or modify the marimo notebooks you just run just marimo src/notebook/<notebook-name>.py to run them as notebooks.

Quick Usage Example:

Using the requests library we can write a file with:

import requests
API_URL = "http://localhost:8000"
library_name = "my-documents"

and depending on what we want to do we can follow the following steps.

Create A Library

"""
Create a new library
"""
library_config = {
    "slug": library_name"
    "embedding_size": 512,
    "chunk_size": 256,
    "chunk_overlap": 64,
    "index": "blob"
}

response = requests.post(f"{API_URL}/library/create", json=library_config)

Inspect the library

response = requests.post(f"{API_URL}/library/{library_name}")

Add a Document

document = {
    "slug": "sample-doc",
    "content": "Your document content goes here..."
}

requests.post(f"{API_URL}/library/{library_name}/documents", json=document)

Do a Search

# Search the library
query = {
    "query": "search terms",
    "results": 5
}

response = requests.post(f"{API_URL}/library/{library_name}/search", json=query)
results = response.json()

Configuration

Library Settings

embedding_size: Vector dimensions (256, 512, 1024, 1536)
chunk_size: Maximum characters per text chunk
chunk_overlap: Overlap between consecutive chunks
index: Indexing algorithm (“blob” or “hnsw”)
index_blob_limit: Maximum items per blob before splitting

Environment Variables

CO_API_KEY: Your Cohere API key (required)
COHERE_CACHE: Enable/disable embedding caching (default: true)
STORAGE_PATH: Directory for data storage (default: “./data/”)

API Endpoints

Libraries

GET /library/list - List all libraries
POST /library/create - Create a new library
GET /library/{slug} - Get library details

Documents

POST /library/{slug}/documents - Add a document
POST /library/{slug}/search - Search documents

Web Interface

/ - Main dashboard with links to all UI components
/docs - Interactive API documentation
/app/create - Library creation interface
/app/insert - Document insertion interface
/app/search - Search interface
/app/seed - Database seeding with sample data

Embedding Caching

The system automatically caches Cohere embeddings to reduce API calls and improve performance. Cached embeddings are stored in the .cohere_embedding directory within your storage path.

Development

Project Structure

├── api.py             # FastAPI application and routes
├── embedding.py       # Cohere integration and caching
├── index.py           # Indexing strategies (Blob and HNSW)
├── model.py           # Core data models and business logic
├── settings.py        # Configuration management
├── utils.py           # Utility functions and vector operations
└── notebook/          # Marimo web applications
    ├── app-create.py
    ├── app-insert.py
    ├── app-search.py
    └── app-seed.py

Key Classes

Library: Main container for documents and configuration
Document: Represents a text document with metadata
Chunk: Individual text segment with embedding vector
IndexStrategy: Abstract interface for search algorithms
BlobIndexStrategy: Clustering-based search implementation
HNSWIndexStrategy: Graph-based search implementation

README.md Unescape Escape