Feature: Precise Prefix Cache Aware Routing
Overview
This is a simple quickstart demonstrating how to configure the inference scheduler to use the new precise prefix cache aware routing based on vLLM KV-Events data. Precise prefix cache aware routing pulls up-to-date prefix cache status from serving instances, eliminating the need for additional indexing services and increasing cache hit rate at high throughput.
Pre-requisites
-
It is assumed that you have the proper tools installed on your local system to use these quickstart. To see what those tools are and minimum versions, check our docs, and to install them, see our install-deps.sh script.
-
You must have the secret containing a HuggingFace Token in the namespace you want to deploy to with key
HF_TOKEN
(see instructions). -
Additionally, it is assumed you have configured and deployed your Gateway Control Plane, and their pre-requisite CRDs. For information on this see the gateway-control-plane-providers directory.
Installation
Use the helmfile to compose and install the stack. The Namespace in which the stack will be deployed will be derived from the ${NAMESPACE}
environment variable. If you have not set this, it will default to llm-d-precise
in this example.
export NAMESPACE=llm-d-precise # Or any namespace your heart desires
cd quickstart/examples/precise-prefix-cache-aware
helmfile apply -n ${NAMESPACE}
NOTE: You can set the $RELEASE_NAME_POSTFIX
env variable to change the release names. This is how we support concurrent installs. Ex: RELEASE_NAME_POSTFIX=kv-events-2 helmfile apply -n ${NAMESPACE}
NOTE: This uses Istio as the default provider, see Gateway Options for installing with a specific provider.
Gateway options
To see specify your gateway choice you can use the -e <gateway option>
flag, ex:
helmfile apply -e kgateway -n ${NAMESPACE}
To see what gateway options are supported refer to our gateway control plane docs. Gateway configurations per provider are tracked in the gateway-configurations directory.
You can also customize your gateway, for more information on how to do that see our gateway customization docs.
Verify the Installation
- Firstly, you should be able to list all helm releases to view the 3 charts got installed into your chosen namespace:
helm list -n ${NAMESPACE}
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gaie-kv-events llm-d-precise 1 2025-08-24 12:05:31.484748 -0700 PDT deployed inferencepool-v0.5.1 v0.5.1
infra-kv-events llm-d-precise 1 2025-08-24 12:05:27.485812 -0700 PDT deployed llm-d-infra-v1.3.0 v0.3.0
ms-kv-events llm-d-precise 1 2025-08-24 12:05:37.660439 -0700 PDT deployed llm-d-modelservice-v0.2.7 v0.2.0
- Out of the box with this example you should have the following resources:
kubectl get all -n ${NAMESPACE}
NAME READY STATUS RESTARTS AGE
pod/gaie-kv-events-epp-687b78968b-wvswh 1/1 Running 0 80s
pod/infra-kv-events-inference-gateway-istio-949d87f84-zvsp2 1/1 Running 0 85s
pod/ms-kv-events-llm-d-modelservice-decode-b874d48d9-bgm5r 2/2 Running 0 75s
pod/ms-kv-events-llm-d-modelservice-decode-b874d48d9-ph64c 2/2 Running 0 75s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/gaie-kv-events-epp ClusterIP 10.16.2.44 <none> 9002/TCP,9090/TCP,5557/TCP 81s
service/gaie-kv-events-ip-805c964d ClusterIP None <none> 54321/TCP 75s
service/infra-kv-events-inference-gateway-istio LoadBalancer 10.16.1.30 10.16.4.2 15021:32033/TCP,80:39332/TCP 86s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/gaie-kv-events-epp 1/1 1 1 81s
deployment.apps/infra-kv-events-inference-gateway-istio 1/1 1 1 86s
deployment.apps/ms-kv-events-llm-d-modelservice-decode 2/2 2 2 76s
NAME DESIRED CURRENT READY AGE
replicaset.apps/gaie-kv-events-epp-687b78968b 1 1 1 81s
replicaset.apps/infra-kv-events-inference-gateway-istio-949d87f84 1 1 1 86s
replicaset.apps/ms-kv-events-llm-d-modelservice-decode-b874d48d9 2 2 2 76s
NOTE: This assumes no other quickstart deployments in your given ${NAMESPACE}
and you have not changed the default release names via the ${RELEASE_NAME}
environment variable.
Testing this "well lit path"
We have docs on getting started sending inference requests available here that are general to all examples. However, this example has unique instructions to interact with it which will be provided here:
- First, you will need to send a basic inference request to your gateway. For in depth documentation on how to do this, please see the link above, but a command will be provided to work out of the box with default settings:
kubectl port-forward -n ${NAMESPACE} service/infra-kv-events-inference-gateway-istio 8000:80
curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d ``{
"model": "Qwen/Qwen3-0.6B",
"prompt": "Hello, how are you?",
"max_tokens": 1
}`` | jq
- Check the inference-scheduler's prefix-cache-scorer's scores with the following command:
kubectl logs -l inferencepool=gaie-kv-events-epp -n ${NAMESPACE} --tail 100 | grep "Got pod scores"
You should see output similar to:
2025-08-24T19:19:16Z LEVEL(-4) prefix-cache-scorer/prefix-cache-scorer scorer/prefix_cache_tracking.go:125 Got pod scores `{"x-request-id": "28b10175-d1f3-45c4-b970-a13dfc6811e3", "model": "Qwen/Qwen3-0.6B", "resolvedTargetModel": "Qwen/Qwen3-0.6B", "criticality": "Sheddable", "scores": null}`
- Repeat steps 5 and 6 to see the prefix-cache-scorer in action
You should see output similar to:
2025-08-24T19:41:23Z LEVEL(-4) prefix-cache-scorer/prefix-cache-scorer scorer/prefix_cache_tracking.go:125 Got pod scores `{"x-request-id": "4d3b41fe-e95e-4628-b6f9-c7b5b20ea69f", "model": "Qwen/Qwen3-0.6B", "resolvedTargetModel": "Qwen/Qwen3-0.6B", "criticality": "Sheddable", "scores": null}`
2025-08-24T19:41:46Z LEVEL(-4) prefix-cache-scorer/prefix-cache-scorer scorer/prefix_cache_tracking.go:125 Got pod scores `{"x-request-id": "6db977c6-96aa-482d-ab89-ad0e114d71d5", "model": "Qwen/Qwen3-0.6B", "resolvedTargetModel": "Qwen/Qwen3-0.6B", "criticality": "Sheddable", "scores": null}`
NOTE: These logs will only appear for unique requests, so if you don't see repeated instances of these logs make sure to redo them in a unique way.
Notice that the second time we called the /v1/completions
endpoint, the prefix-cache-scorer was able to return a score for the pod,
indicating that it had cached the KV-blocks from the first call.
- See the
kvblock.Index
metrics in thegaie-kv-events-epp
pod:
kubectl logs -l inferencepool=gaie-kv-events-epp -n llm-d-precise --tail 100 | grep "metrics beat"
You should see output similar to:
I0718 23:57:10.781371 1 collector.go:107] "metrics beat" logger="metrics" admissions=3 evictions=0 lookups=1 hits=2 latency_count=1 latency_sum=0.000006859 latency_avg=0.0000022863333333333334
The admissions
count indicates how many KV-blocks were added to the index through vLLM's KV-Events,
while the hits
count indicates how many times the index was able to find a KV-block for a pod.
If the beat is missing lookups, wait for the next one (1 minute beats).
Cleanup
To remove the deployment:
# Remove the model services
# From examples/precise-prefix-cache-aware
helmfile destroy -n ${NAMESPACE}
# Or uninstall manually
helm uninstall infra-kv-events -n ${NAMESPACE}
helm uninstall gaie-kv-events -n ${NAMESPACE}
helm uninstall ms-kv-events -n ${NAMESPACE}
NOTE: If you set the $RELEASE_NAME_POSTFIX
environment variable, your release names will be different from the command above: infra-$RELEASE_NAME_POSTFIX
, gaie-$RELEASE_NAME_POSTFIX
and ms-$RELEASE_NAME_POSTFIX
.
NOTE: You do not need to specify your environment
with the -e <environment>
flag to helmfile
for removing a installation of the quickstart, even if you use a non-default option. You do, however, have to set the -n ${NAMESPACE}
otherwise it may not cleanup the releases in the proper namespace.
Customization
For information on customizing an installation of a quickstart path and tips to build your own, see our docs
This content is automatically synced from quickstart/examples/precise-prefix-cache-aware/README.md in the llm-d-incubation/llm-d-infra repository.
📝 To suggest changes, please edit the source file or create an issue.