
1.4 EXAFLOP CLUSTER
POWERING 1.2T TOKENS PER DAY
CHAI has seen demand for its AI models grow exponentially. This demand has exceeded the capacity and ability of off-the-shelf providers. Out of necessity, we have had to verticalize and bring inference in-house. Starting small with a cluster of A5000s rented on-demand from CoreWeave in 2023, we've grown to a cluster size of thousands of GPUs, spread across 4 regions. Multi-region inference has challenges and has brought CHAI to the cutting edge of technology.
TOKENS PROCESSED PER DAY






1.4 EXAFLOPS GPU CLUSTER
FOR AI INFERENCE
kCLUSTER
At CHAI, we serve hundreds of in-house trained LLMs across several GPU chip types from both AMD and Nvidia. While open-source solutions such as vLLM work well for simple workloads, we've found that we can further improve upon vLLM by almost an order of magnitude through several optimizations, such as custom kernels and compute-efficient attention approximations.