Self-Hosted LLM APIs

Project: Custom LLM Inference Engine • Type: R&D / Internal Product • Status: In development

The Problem

Companies using OpenAI/Anthropic APIs face three concerns:

The standard self-hosted alternatives (vLLM, llama.cpp server) are Python-based or bring their own operational complexity.

ML should be an engineering concern like databases or caching. No Python. Binary deployment. Custom GPU kernels.

GoLang API server with CGo bindings to llama.cpp
Custom binary tokenizer (written from scratch)
Custom GPU kernels (Metal first, portable to CUDA):
- Concurrent Q/K/V dispatch
- Fused QKV projection
- Quantized KV cache (Q8)
- Function constant specialization
Dynamic KV cache management (vs static preallocation in vLLM/llama.cpp)
Batched attention with continuous batching
Single binary deployment

The existing frameworks make tradeoffs we don't want. We're optimizing for dynamic memory management and operational simplicity.

R&D. Not yet deployed to production. Available for early adopter engagements.

GoLang CGo Metal CUDA