Monitoring

The cache service exposes health endpoints, Prometheus metrics, and structured logs. Monitor it as an acceleration layer: cache failures matter, but origin availability remains the correctness boundary.

Health Endpoints

Endpoint	Use
`/v1/health/live`	Liveness. Returns `ok` when the process is running.
`/v1/health`	Readiness. Returns `ok` when the service can reach origin.
`/health/live`	Compatibility alias.
`/health`	Compatibility alias.

Use liveness for restarts. Use readiness for load-balancer rotation.

Prometheus

Scrape:

scrape_configs:
  - job_name: crab-cache
    static_configs:
      - targets: ["crab-cache.example.com:8443"]
    metrics_path: /v1/metrics

Important metrics:

Metric	Meaning
`cache_hit_total`	Cache hits by object type.
`cache_miss_total`	Cache misses by object type.
`cache_bytes_served`	Bytes served to clients, split by hit and miss.
`cache_bytes_stored`	Current cache size.
`origin_fetch_total`	Misses that required origin reads.
`origin_fetch_bytes`	Bytes fetched from origin.
`push_warming_total`	Successful push-warming writes.
`dedup_query_total`	Dedup query count.
`dedup_chunks_known`	Chunks reported as already known.
`dedup_chunks_unknown`	Chunks reported as unknown.
`cache_eviction_total`	Evicted objects by type.

Useful Dashboard Panels

Track:

Cache hit rate.
Bytes served from cache versus origin.
Cache utilization versus configured budget.
Origin fetch latency.
Push warming rate.
Dedup known/unknown ratio.
4xx and 5xx response rate.

Example Queries

Cache hit rate:

sum(rate(cache_hit_total[5m])) /
  (sum(rate(cache_hit_total[5m])) + sum(rate(cache_miss_total[5m])))

Bytes served from cache:

sum(rate(cache_bytes_served{hit="true"}[1h]))

Dedup ratio:

sum(rate(dedup_chunks_known[5m])) /
  (sum(rate(dedup_chunks_known[5m])) + sum(rate(dedup_chunks_unknown[5m])))

Alerts

Recommended alerts:

Alert	Condition
Cache down	Prometheus cannot scrape the service.
Origin unreachable	Readiness fails for several minutes.
Hit rate low	Hit rate remains low after the cache should be warm.
Cache near full	Cache usage exceeds the planned high-water point.
Origin latency high	Origin miss path becomes slow.
Auth failures spike	401 or 403 rate increases unexpectedly.

Do not page on a low hit rate immediately after a new deployment or after replacing the cache volume. The cache needs time to warm.

Logs

Use JSON logs in production:

[logging]
format = "json"
level = "info"

Use logs to answer:

Are clients reaching the service?
Are requests hitting cache or falling through to origin?
Are push-warming requests arriving?
Are auth failures caused by missing credentials or policy denial?

Monitoring

Health Endpoints

Endpoint	Use
`/v1/health/live`	Liveness. Returns `ok` when the process is running.
`/v1/health`	Readiness. Returns `ok` when the service can reach origin.
`/health/live`	Compatibility alias.
`/health`	Compatibility alias.

Use liveness for restarts. Use readiness for load-balancer rotation.

Prometheus

Scrape:

scrape_configs:
  - job_name: crab-cache
    static_configs:
      - targets: ["crab-cache.example.com:8443"]
    metrics_path: /v1/metrics

Important metrics:

Metric	Meaning
`cache_hit_total`	Cache hits by object type.
`cache_miss_total`	Cache misses by object type.
`cache_bytes_served`	Bytes served to clients, split by hit and miss.
`cache_bytes_stored`	Current cache size.
`origin_fetch_total`	Misses that required origin reads.
`origin_fetch_bytes`	Bytes fetched from origin.
`push_warming_total`	Successful push-warming writes.
`dedup_query_total`	Dedup query count.
`dedup_chunks_known`	Chunks reported as already known.
`dedup_chunks_unknown`	Chunks reported as unknown.
`cache_eviction_total`	Evicted objects by type.

Useful Dashboard Panels

Track:

Cache hit rate.
Bytes served from cache versus origin.
Cache utilization versus configured budget.
Origin fetch latency.
Push warming rate.
Dedup known/unknown ratio.
4xx and 5xx response rate.

Example Queries

Cache hit rate:

sum(rate(cache_hit_total[5m])) /
  (sum(rate(cache_hit_total[5m])) + sum(rate(cache_miss_total[5m])))

Bytes served from cache:

sum(rate(cache_bytes_served{hit="true"}[1h]))

Dedup ratio:

sum(rate(dedup_chunks_known[5m])) /
  (sum(rate(dedup_chunks_known[5m])) + sum(rate(dedup_chunks_unknown[5m])))

Alerts

Recommended alerts:

Alert	Condition
Cache down	Prometheus cannot scrape the service.
Origin unreachable	Readiness fails for several minutes.
Hit rate low	Hit rate remains low after the cache should be warm.
Cache near full	Cache usage exceeds the planned high-water point.
Origin latency high	Origin miss path becomes slow.
Auth failures spike	401 or 403 rate increases unexpectedly.

Do not page on a low hit rate immediately after a new deployment or after replacing the cache volume. The cache needs time to warm.

Logs

Use JSON logs in production:

[logging]
format = "json"
level = "info"

Use logs to answer:

Are clients reaching the service?
Are requests hitting cache or falling through to origin?
Are push-warming requests arriving?
Are auth failures caused by missing credentials or policy denial?

Monitoring

Health Endpoints

Prometheus

Useful Dashboard Panels

Example Queries

Alerts

Logs

On this page

Monitoring

Health Endpoints

Prometheus

Useful Dashboard Panels

Example Queries

Alerts

Logs

On this page