1
0
Go to file
2025-05-30 15:00:59 -07:00
src Corrected readme. 2025-05-30 15:00:59 -07:00
.dockerignore Big Bang 2025-05-30 14:42:08 -07:00
.gitignore Big Bang 2025-05-30 14:42:08 -07:00
.python-version Big Bang 2025-05-30 14:42:08 -07:00
docker-compose.yaml Big Bang 2025-05-30 14:42:08 -07:00
Dockerfile Big Bang 2025-05-30 14:42:08 -07:00
Justfile Corrected readme. 2025-05-30 15:00:59 -07:00
pyproject.toml Big Bang 2025-05-30 14:42:08 -07:00
README.md Corrected readme. 2025-05-30 15:00:59 -07:00
uv.lock Big Bang 2025-05-30 14:42:08 -07:00

Vector Database

A vector database built with pure-Python (no NumPy), FastAPI, and Cohere embeddings.

Features

  • Two Indexing Strategies: Choose between blob-based clustering and HNSW (Hierarchical Navigable Small World)
  • Cohere Integration: We use Coheres embed-v4.0 model with caching, for our vector embedding.
  • Flexible Chunking: Configurable text chunking with overlap.
  • RESTful API: Complete FastAPI-based REST API with automatic documentation
  • Interactive UI: Marimo-based web applications for easy database management
  • Persistent Storage: JSON-based file storage with configurable paths

Architecture

Core Concepts

  • Libraries: Collections of documents with configurable embedding and chunking settings
  • Documents: Text content that gets chunked and embedded for search
  • Chunks: Individual text segments with their vector embeddings
  • Index Strategies: Two available algorithms for efficient similarity search

Indexing Algorithms

  1. Blob Index: Hierarchical clustering with automatic splitting when clusters exceed size limits
  2. HNSW Index: Graph-based approximate nearest neighbor search for high-performance queries

Installation

Set up Cohere API key in the environment variable file .env

export CO_API_KEY="your-cohere-api-key"

To install you can do a uv sync, and to run the API you could just do a just api (we are using a Justfile).

To run or modify the marimo notebooks you just run just marimo src/notebook/<notebook-name>.py to run them as notebooks.

Quick Usage Example:

Using the requests library we can write a file with:

import requests
API_URL = "http://localhost:8000"
library_name = "my-documents"

and depending on what we want to do we can follow the following steps.

Create A Library

"""
Create a new library
"""
library_config = {
    "slug": library_name"
    "embedding_size": 512,
    "chunk_size": 256,
    "chunk_overlap": 64,
    "index": "blob"
}

response = requests.post(f"{API_URL}/library/create", json=library_config)

Inspect the library

response = requests.post(f"{API_URL}/library/{library_name}")

Add a Document

document = {
    "slug": "sample-doc",
    "content": "Your document content goes here..."
}

requests.post(f"{API_URL}/library/{library_name}/documents", json=document)
# Search the library
query = {
    "query": "search terms",
    "results": 5
}

response = requests.post(f"{API_URL}/library/{library_name}/search", json=query)
results = response.json()

Configuration

Library Settings

  • embedding_size: Vector dimensions (256, 512, 1024, 1536)
  • chunk_size: Maximum characters per text chunk
  • chunk_overlap: Overlap between consecutive chunks
  • index: Indexing algorithm (“blob” or “hnsw”)
  • index_blob_limit: Maximum items per blob before splitting

Environment Variables

  • CO_API_KEY: Your Cohere API key (required)
  • COHERE_CACHE: Enable/disable embedding caching (default: true)
  • STORAGE_PATH: Directory for data storage (default: “./data/”)

API Endpoints

Libraries

  • GET /library/list - List all libraries
  • POST /library/create - Create a new library
  • GET /library/{slug} - Get library details

Documents

  • POST /library/{slug}/documents - Add a document
  • POST /library/{slug}/search - Search documents

Web Interface

  • / - Main dashboard with links to all UI components
  • /docs - Interactive API documentation
  • /app/create - Library creation interface
  • /app/insert - Document insertion interface
  • /app/search - Search interface
  • /app/seed - Database seeding with sample data

Embedding Caching

The system automatically caches Cohere embeddings to reduce API calls and improve performance. Cached embeddings are stored in the .cohere_embedding directory within your storage path.

Development

Project Structure

├── api.py             # FastAPI application and routes
├── embedding.py       # Cohere integration and caching
├── index.py           # Indexing strategies (Blob and HNSW)
├── model.py           # Core data models and business logic
├── settings.py        # Configuration management
├── utils.py           # Utility functions and vector operations
└── notebook/          # Marimo web applications
    ├── app-create.py
    ├── app-insert.py
    ├── app-search.py
    └── app-seed.py

Key Classes

  • Library: Main container for documents and configuration
  • Document: Represents a text document with metadata
  • Chunk: Individual text segment with embedding vector
  • IndexStrategy: Abstract interface for search algorithms
  • BlobIndexStrategy: Clustering-based search implementation
  • HNSWIndexStrategy: Graph-based search implementation