← Case Studies
Self-Hosted LLM APIs
Project: Custom LLM Inference Engine • Type: R&D / Internal Product • Status: In development
The Problem
Companies using OpenAI/Anthropic APIs face three concerns:
- Cost volatility — providers can raise prices anytime
- Data privacy — your prompts and data flow through third-party systems
- Control — operational consistency depends on someone else's infrastructure
The standard self-hosted alternatives (vLLM, llama.cpp server) are Python-based or bring their own operational complexity.
The Approach
ML should be an engineering concern like databases or caching. No Python. Binary deployment. Custom GPU kernels.
What We're Building
- GoLang API server with CGo bindings to llama.cpp
- Custom binary tokenizer (written from scratch)
- Custom GPU kernels (Metal first, portable to CUDA):
- Concurrent Q/K/V dispatch
- Fused QKV projection
- Quantized KV cache (Q8)
- Function constant specialization
- Dynamic KV cache management (vs static preallocation in vLLM/llama.cpp)
- Batched attention with continuous batching
- Single binary deployment
Why Custom Kernels
The existing frameworks make tradeoffs we don't want. We're optimizing for dynamic memory management and operational simplicity.
R&D. Not yet deployed to production. Available for early adopter engagements.
GoLang CGo Metal CUDA
