Observability (Prometheus, Grafana, Loki, Tempo)
This cluster includes a standard observability stack for metrics, logs, and traces.
- Prometheus Stack (Prometheus + Alerting components + Grafana)
- Loki (logs)
- Tempo (traces)
- (Optional) Alloy (agent/collector depending on how it is configured in your vendor setup)
The stack is deployed via Argo CD using the vendor + overrides model.
Architecture: vendor vs overrides
Vendor
Vendor charts live under:
vendor/applications/prometheus-stackvendor/applications/lokivendor/applications/tempovendor/applications/alloy
The vendor prometheus-stack chart also ships example dashboards as templates:
templates/DCGM-dashboard.yaml(GPU dashboard)templates/fastapi-dashboard.yamltemplates/webui-dashboard.yaml
Overrides
Cluster-specific changes live under:
overrides/prometheus-stack/values.yamloverrides/prometheus-stack/templates/*(e.g. sealed secrets)
This typically includes:
- Grafana ingress configuration
- OAuth/OIDC authentication settings
- Datasource wiring (Loki + Tempo)
- TLS and domain settings
- Any required secrets (OIDC client secret, etc.)
GitOps deployment (Argo CD)
Observability applications are usually defined in:
overrides/argo-cd-resources/values.yaml
Typical apps:
prometheus-stack(namespacemonitoring)loki(namespacemonitoring)tempo(namespacemonitoring)alloy(namespacemonitoring)
Once Argo CD is bootstrapped, the stack is reconciled automatically by Argo CD.
Prometheus Stack configuration
Cluster settings are defined in:
overrides/prometheus-stack/values.yaml
Common configuration areas:
Grafana Ingress / external access
Ingress settings are cluster-specific and configured in the overrides values file.
Because different municipalities will use different DNS and domains, the documentation stays generic:
- Update the Grafana ingress host(s) to match your environment
- Ensure TLS secret name matches your cluster (e.g. wildcard cert)
Example (conceptual):
prometheus-stack:
grafana:
ingress:
enabled: true
ingressClassName: traefik
hosts:
- <grafana-domain>
tls:
- secretName: <tls-secret>
hosts:
- <grafana-domain>
If you do not use ingress, you can port-forward Grafana locally instead (see below).
Grafana authentication (OIDC / OAuth)
Grafana can be configured to authenticate using an OIDC provider (for example Authentik).
In this setup:
- Grafana reads the OIDC client secret from an environment variable
- The secret is provided via a Kubernetes Secret (often via Sealed Secrets)
Sealed secret
The override includes a template like:
overrides/prometheus-stack/templates/sealed-grafana-oidc-secret.yaml
This secret typically provides:
GRAFANA_OAUTH_CLIENT_SECRET
Grafana ini configuration
OIDC settings are configured in:
overrides/prometheus-stack/values.yamlundergrafana.grafana.ini
Example concepts:
auth.generic_oauth.enabled = trueauth.generic_oauth.client_id = ...auth.generic_oauth.client_secret = $__env{GRAFANA_OAUTH_CLIENT_SECRET}- issuer/authorize/token/userinfo endpoints for your provider
Important: Keep all municipality-specific URLs (issuer/auth/token endpoints and domains) in overrides only.
Custom Grafana image (custom CA certificates)
When Grafana needs to talk to an internal OIDC provider using a certificate signed by an internal CA, you must ensure Grafana trusts that CA.
This repo uses a custom Grafana image that includes the internal CA certificate:
docker/grafana/Dockerfile
Example pattern:
- Copy the internal CA certificate into the container
- Install/update CA certificates in the image
This is necessary when:
- your OIDC provider uses internal TLS certificates
- Grafana fails with TLS verification errors when contacting auth endpoints
Datasources (Loki + Tempo)
Grafana is configured to include additional datasources:
- Loki for logs
- Tempo for traces
These are commonly configured in:
overrides/prometheus-stack/values.yamlundergrafana.additionalDataSources
Conceptual example:
grafana:
additionalDataSources:
- name: Loki
type: loki
url: http://loki-headless:3100
- name: Tempo
type: tempo
url: http://tempo:3200
jsonData:
tracesToLogs:
datasourceUid: 'Loki'
serviceMap:
datasourceUid: 'prometheus'
This enables cross-navigation:
- traces → related logs
- traces → service map
- traces → metrics panels (if configured)
Access Grafana
Option A: Port-forward (recommended for initial validation)
kubectl -n monitoring port-forward svc/prometheus-stack-grafana 3000:80
Then open:
- http://localhost:3000
Option B: Ingress
If ingress is enabled, use your configured DNS name (from overrides).
Validation checklist
1) Pods running
kubectl -n monitoring get pods
2) Grafana datasources present
In Grafana UI:
- Check Connections → Data sources
- Ensure Prometheus is present
- Ensure Loki and Tempo are present (if enabled)
3) Dashboards present
In Grafana UI:
- Check Dashboards
- Confirm vendor dashboards (GPU / FastAPI / WebUI) exist if those templates are enabled
4) Loki logs query works
In Grafana Explore:
- Select Loki datasource
- Query
{namespace="monitoring"}or another relevant namespace
5) Tempo traces visible (if instrumented)
In Grafana Explore:
- Select Tempo datasource
- Query traces for instrumented services
Troubleshooting
Grafana cannot log in via OIDC
Common causes:
- wrong issuer/auth/token/userinfo URLs
- TLS errors (missing CA in the Grafana image)
- missing
GRAFANA_OAUTH_CLIENT_SECRETsecret - wrong redirect URL configured in the OIDC provider
Check:
- Grafana pod logs:
kubectl -n monitoring logs deploy/prometheus-stack-grafana - Ensure the secret exists:
kubectl -n monitoring get secret grafana-oidc-secret -o yaml
TLS verification errors to OIDC provider
If logs mention certificate verification issues:
- confirm Grafana uses the custom image that includes internal CA certs
- confirm the CA certificate is correct and updated
- confirm
update-ca-certificatesruns successfully in the image build
Loki / Tempo datasource not working
Check:
- service names and URLs in
additionalDataSources - that Loki/Tempo services exist:
kubectl -n monitoring get svc | egrep "loki|tempo|grafana|prom" - network policies / ingress policies (if any)