1.4 EXAFLOP CLUSTER

POWERING 1.2T TOKENS PER DAY

OPS PER SECOND

LLMS DEPLOYED

TOKENS PER DAY

CHAI has seen demand for its AI models grow exponentially. This demand has exceeded the capacity and ability of off-the-shelf providers. Out of necessity, we have had to verticalize and bring inference in-house. Starting small with a cluster of A5000s rented on-demand from CoreWeave in 2023, we've grown to a cluster size of thousands of GPUs, spread across 4 regions. Multi-region inference has challenges and has brought CHAI to the cutting edge of technology.

TOKENS PROCESSED PER DAY

8.1T

1.2T

0.8T

0.4T

0.3T

0.1T

[ GPU Cluster ]

1.4 EXAFLOPS GPU CLUSTER
FOR AI INFERENCE

CLUSTER

At CHAI, we serve hundreds of in-house trained LLMs across several GPU chip types from both AMD and Nvidia. While open-source solutions such as vLLM work well for simple workloads, we've found that we can further improve upon vLLM by almost an order of magnitude through several optimizations, such as custom kernels and compute-efficient attention approximations.

NUMBER OF GPUS

5000 GPUs

NUMBER OF TOKENS SERVED

1.2T Tokens / Day

NUMBER OF UNIQUE LLMS SERVED

51K LLMs

CLUSTER COMPUTE PERFORMANCE

>1.4 Exaflops