1
0

Corrected readme.

This commit is contained in:
Miguel Salgado 2025-05-30 15:00:59 -07:00
parent 19f335e300
commit efb3ea003c
3 changed files with 133 additions and 93 deletions

View File

@ -1,6 +1,10 @@
api:
uv run --env-file .env fastapi dev src/api.py
marimo NOTEBOOK:
uv run --env-file .env marimo edit {{NOTEBOOK}}
docker:
docker compose build
docker compose up

220
README.md
View File

@ -1,130 +1,168 @@
# Take-at-Home Task - Backend (Vector DB)
# Vector Database
Congrats on making it thus far in the interview process!
Here is a task for you to show us where you shine the most 🙂
A vector database built with pure-Python (no NumPy), FastAPI, and Cohere embeddings.
The purpose is not to see how fast you go or what magic tricks you know in python, its mostly to understand how clearly you think and code.
## Features
If you think clearly and your code is clean, you are better than 90% of applicants already!
- **Two Indexing Strategies**: Choose between blob-based clustering and HNSW (Hierarchical Navigable Small World)
- **Cohere Integration**: We use Cohere's embed-v4.0 model with caching, for our vector embedding.
- **Flexible Chunking**: Configurable text chunking with overlap.
- **RESTful API**: Complete FastAPI-based REST API with automatic documentation
- **Interactive UI**: Marimo-based web applications for easy database management
- **Persistent Storage**: JSON-based file storage with configurable paths
> ⚠ Feel free to use Cursor, but use it where it makes sense, dont overuse it, it introduces bugs and is super verbose and not really pythonic.
>
## Architecture
## Objective
### Core Concepts
The goal of this project is to develop a REST API that allows users to **index** and **query** their documents within a Vector Database.
- **Libraries**: Collections of documents with configurable embedding and chunking settings
- **Documents**: Text content that gets chunked and embedded for search
- **Chunks**: Individual text segments with their vector embeddings
- **Index Strategies**: Two available algorithms for efficient similarity search
A Vector Database specializes in storing and indexing vector embeddings, enabling fast retrieval and similarity searches. This capability is crucial for applications involving natural language processing, recommendation systems, and many more…
### Indexing Algorithms
The REST API should be containerized in a Docker container.
1. **Blob Index**: Hierarchical clustering with automatic splitting when clusters exceed size limits
2. **HNSW Index**: Graph-based approximate nearest neighbor search for high-performance queries
### Definitions
To ensure a clear understanding, let's define some key concepts:
1. Chunk: A chunk is a piece of text with an associated embedding and metadata.
2. Document: A document is made out of multiple chunks, it also contains metadata.
3. Library: A library is made out of a list of documents and can also contain other metadata.
The API should:
1. Allow the users to create, read, update, and delete libraries.
2. Allow the users to create, read, update and delete chunks within a library.
3. Index the contents of a library.
4. Do **k-Nearest Neighbor vector search** over the selected library with a given embedding query.
### Guidelines:
The code should be **Python** since that is what we use to develop our backend.
Here is a suggested path on how to implement a basic solution to the problem.
1. Define the Chunk, Document and Library classes. To simplify schema definition, we suggest you use a fixed schema for each of the classes. This means not letting the user define which fields should be present within the metadata for each class.
## Installation
2. Implement two or three indexing algorithms, do not use external libraries, we want to see you code them up.
1. What is the space and time complexity for each of the indexes?
2. Why did you choose this index?
Set up Cohere API key in the environment variable file `.env`
3. Implement the necessary data structures/algorithms to ensure that there are no data races between reads and writes to the database.
1. Explain your design choices.
```bash
export CO_API_KEY="your-cohere-api-key"
```
4. Create the logic to do the CRUD operations on libraries and documents/chunks.
1. Ideally use Services to decouple API endpoints from actual work
To install you can do a `uv sync`, and to run the API you could just do a `just api` (we are using a Justfile).
5. Implement an API layer on top of that logic to let users interact with the vector database.
To run or modify the marimo notebooks you just run `just marimo src/notebook/<notebook-name>.py` to run them as notebooks.
6. Create a docker image for the project
### Extra Points:
## Quick Usage Example:
Here are some additional suggestions on how to enhance the project even further. You are not required to implement any of these, but if you do, we will value it. If you have other improvements in mind, please feel free to implement them and document them in the projects README file
Using the requests library we can write a file with:
1. **Metadata filtering:**
- Add the possibility of using metadata filters to enhance query results: ie: do kNN search over all chunks created after a given date, whose name contains xyz string etc etc.
2. **Persistence to Disk**:
- Implement a mechanism to persist the database state to disk, ensuring that the docker container can be restarted and resume its operation from the last checkpoint. Explain your design choices and tradeoffs, considering factors like performance, consistency, and durability.
3. **Leader-Follower Architecture**:
- Design and implement a leader-follower (master-slave) architecture to support multiple database nodes within the Kubernetes cluster. This architecture should handle read scalability and provide high availability. Explain how leader election, data replication, and failover are managed, along with the benefits and tradeoffs of this approach.
4. **Python SDK Client**:
- Develop a Python SDK client that interfaces with your API, making it easier for users to interact with the vector database programmatically. Include documentation and examples.
```python
import requests
API_URL = "http://localhost:8000"
library_name = "my-documents"
```
## Constraints
and depending on what we want to do we can follow the following steps.
Do **not** use libraries like chroma-db, pinecone, FAISS, etc to develop the project, we want to see you write the algorithms yourself. You can use numpy to calculate trigonometry functions `cos` , `sin` , etc
You **do not need to build a document processing pipeline** (ocr+text extraction+chunking) to test your system. Using a bunch of manually created chunks will suffice.
#### Create A Library
## **Tech Stack**
```python
"""
Create a new library
"""
library_config = {
"slug": library_name"
"embedding_size": 512,
"chunk_size": 256,
"chunk_overlap": 64,
"index": "blob"
}
- **API Backend:** Python + FastAPI + Pydantic
response = requests.post(f"{API_URL}/library/create", json=library_config)
```
## Resources:
[Cohere](https://cohere.com/embeddings) API key to create the embeddings for your test.
#### Inspect the library
## Evaluation Criteria
```python
response = requests.post(f"{API_URL}/library/{library_name}")
```
We will evaluate the code functionality and its quality.
#### Add a Document
**Code quality:**
```python
document = {
"slug": "sample-doc",
"content": "Your document content goes here..."
}
- [SOLID design principles](https://realpython.com/solid-principles-python/).
- Use of static typing.
- FastAPI good practices.
- Pydantic schema validation
- Code modularity and reusability.
- Use of RESTful API endpoints.
- Project containerization with Docker.
- Testing
- Error handling.
- If you know what Domain-Driven design is, do it that way!
- Separate API endpoints from business logic using services and from databases using repositories
- Keep code as pythonic as possible
- Do early returns
- Use inheritance where needed
- Use composition over inheritance
requests.post(f"{API_URL}/library/{library_name}/documents", json=document)
```
**Functionality:**
#### Do a Search
- Does everything work as expected?
```python
# Search the library
query = {
"query": "search terms",
"results": 5
}
## Deliverable
response = requests.post(f"{API_URL}/library/{library_name}/search", json=query)
results = response.json()
```
1. **Source Code**: A link to a GitHub repository containing all your source code.
2. **Documentation**: A README file that documents the task, explains your technical choices, how to run the project locally, and any other relevant information.
3. **Demo video:**
1. A screen recording where you show how to install the project and interact with it in real time.
2. A screen recording of your design with an explanation of your design choices and thoughts/problem-solving.
## Configuration
## Timeline
### Library Settings
As a reference, this task should take at most **4 days** (96h) from the receipt of this test to submit your deliverables 🚀 
- **embedding_size**: Vector dimensions (256, 512, 1024, 1536)
- **chunk_size**: Maximum characters per text chunk
- **chunk_overlap**: Overlap between consecutive chunks
- **index**: Indexing algorithm ("blob" or "hnsw")
- **index_blob_limit**: Maximum items per blob before splitting
But honestly, if you think you can do a much better job with some extra days (perhaps because you couldnt spend too many hours), be our guest!
### Environment Variables
At the end of the day, if it is not going to impress the team, its not going to fly, so give it your best shot ✈️
- **CO_API_KEY**: Your Cohere API key (required)
- **COHERE_CACHE**: Enable/disable embedding caching (default: true)
- **STORAGE_PATH**: Directory for data storage (default: "./data/")
## Questions
## API Endpoints
### Libraries
- `GET /library/list` - List all libraries
- `POST /library/create` - Create a new library
- `GET /library/{slug}` - Get library details
### Documents
- `POST /library/{slug}/documents` - Add a document
- `POST /library/{slug}/search` - Search documents
### Web Interface
- `/` - Main dashboard with links to all UI components
- `/docs` - Interactive API documentation
- `/app/create` - Library creation interface
- `/app/insert` - Document insertion interface
- `/app/search` - Search interface
- `/app/seed` - Database seeding with sample data
### Embedding Caching
The system automatically caches Cohere embeddings to reduce API calls and improve performance. Cached embeddings are stored in the `.cohere_embedding` directory within your storage path.
## Development
### Project Structure
```
├── api.py # FastAPI application and routes
├── embedding.py # Cohere integration and caching
├── index.py # Indexing strategies (Blob and HNSW)
├── model.py # Core data models and business logic
├── settings.py # Configuration management
├── utils.py # Utility functions and vector operations
└── notebook/ # Marimo web applications
├── app-create.py
├── app-insert.py
├── app-search.py
└── app-seed.py
```
### Key Classes
- **Library**: Main container for documents and configuration
- **Document**: Represents a text document with metadata
- **Chunk**: Individual text segment with embedding vector
- **IndexStrategy**: Abstract interface for search algorithms
- **BlobIndexStrategy**: Clustering-based search implementation
- **HNSWIndexStrategy**: Graph-based search implementation
Feel free to reach out at any given time with questions about the task, particularly if you encounter problems outside your control that may block your progress.

View File

@ -7,8 +7,6 @@ class Settings(BaseSettings):
co_api_key: SecretStr
cohere_cache: bool = True
storage_path: Path = Path("./data/")
fastapi_port: int = 8000
fastapi_host: str = "localhost"
settings = Settings()