Monitoring Guide — Membership One

1. Overview

Membership One uses a comprehensive monitoring stack to ensure system reliability, performance, and business observability.

Monitoring Stack

Component	Purpose	Access
Prometheus	Metrics collection and storage	`http://prometheus.membership-one.com`
Grafana	Dashboards and visualization	`http://grafana.membership-one.com`
Loki	Log aggregation	Via Grafana (Explore)
Alertmanager	Alert routing and notification	Via Prometheus
Icinga	External uptime monitoring	`http://icinga.membership-one.com`
Spring Actuator	Application metrics endpoint	`/api/actuator/prometheus`

2. Spring Actuator Metrics

The backend exposes Micrometer Prometheus metrics at /api/actuator/prometheus.

Key Metrics

Metric	Description	Type
`jvm_memory_used_bytes`	JVM heap/non-heap memory usage	Gauge
`jvm_gc_pause_seconds`	GC pause duration	Summary
`jvm_threads_live_threads`	Active thread count	Gauge
`http_server_requests_seconds`	HTTP request duration by method/status/uri	Timer
`hikaricp_connections_active`	Active DB connections	Gauge
`hikaricp_connections_idle`	Idle DB connections	Gauge
`hikaricp_connections_pending`	Pending DB connections	Gauge
`spring_data_repository_invocations_seconds`	Repository query time	Timer
`resilience4j_circuitbreaker_state`	Cash360 circuit breaker state	Gauge
`resilience4j_retry_calls_total`	Cash360 retry count	Counter
`membership_billing_cycle_duration_seconds`	Billing cycle execution time	Timer
`membership_checkin_total`	Check-in count by method/zone	Counter
`membership_active_members`	Total active members	Gauge
`membership_active_contracts`	Total active contracts	Gauge

Custom Business Metrics

@Component
public class MembershipMetrics {
    private final Counter checkinCounter;
    private final Timer billingTimer;

    public MembershipMetrics(MeterRegistry registry) {
        this.checkinCounter = Counter.builder("membership.checkin.total")
            .tag("method", "qr").register(registry);
        this.billingTimer = Timer.builder("membership.billing.cycle.duration")
            .register(registry);
    }
}

3. Prometheus Configuration

Scrape Config

# prometheus.yml
scrape_configs:
  - job_name: 'membership-api'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['membership-prod']
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        target_label: __metrics_path__
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        target_label: __address__
        regex: (.+)
        replacement: ${1}:8081
    metrics_path: /api/actuator/prometheus
    scrape_interval: 15s

Pod Annotations

The Helm chart configures these annotations on pods:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/path: "/api/actuator/prometheus"
  prometheus.io/port: "8081"

4. Grafana Dashboards

Dashboard 1: JVM and Application Health

JVM Heap Usage — used vs. committed vs. max (line chart)
GC Pause Duration — p50, p95, p99 (histogram)
Thread Count — live, daemon, peak (line chart)
CPU Usage — process_cpu_usage vs. system_cpu_usage (gauge)
Uptime — process_uptime_seconds (stat panel)
Active Profiles — spring_profiles_active (text panel)

Dashboard 2: API Performance

Request Rate — requests/second by status code (line chart)
Response Time — p50, p95, p99 by endpoint (heatmap)
Error Rate — 4xx and 5xx as percentage (gauge, threshold: <1%)
Slowest Endpoints — top 10 by avg duration (table)
Request Size — incoming payload sizes (histogram)

Dashboard 3: Database and Connections

Active DB Connections — vs. pool max (line chart, threshold: <80% max)
Pending Connections — queue depth (line chart, alert if > 5)
Connection Wait Time — hikaricp_connections_acquire_seconds (histogram)
Query Duration — repository invocation times by method (heatmap)
Flyway Migration Status — version, applied count (stat panel)

Dashboard 4: Business KPIs

Active Members — total count (stat panel)
New Registrations — daily/weekly trend (bar chart)
Check-ins Today — by method QR/NFC/BLE (pie chart)
Monthly Recurring Revenue — from active contracts (stat panel)
Open Debt — total outstanding amount (gauge)
Contract Conversion — trial to paid ratio (gauge)
Cash360 Circuit Breaker — open/half-open/closed state (status map)

Dashboard 5: Infrastructure

Pod Status — running/pending/failed (status grid)
CPU/Memory per Pod — resource usage vs. limits (line chart)
Node Resource Usage — Hetzner node utilization (bar chart)
Persistent Volume — disk usage (gauge)
Network I/O — bytes in/out per pod (line chart)

5. Loki Log Aggregation

Log Format

The application outputs structured JSON logs to stdout:

{
  "timestamp": "2026-02-23T10:15:30.123Z",
  "level": "INFO",
  "logger": "c.m.payment.service.BillingService",
  "message": "Billing cycle completed",
  "traceId": "abc123def456",
  "entityId": 42,
  "processedCount": 150,
  "duration": 3200
}

Useful LogQL Queries

# All errors in the last hour
{namespace="membership-prod", app="membership-api"} |= "ERROR"

# Payment failures
{namespace="membership-prod"} | json | logger =~ ".*payment.*" | level = "ERROR"

# Slow requests (> 2s)
{namespace="membership-prod"} | json | duration > 2000

# Authentication failures
{namespace="membership-prod"} | json | message =~ ".*authentication failed.*"

# Billing cycle results
{namespace="membership-prod"} | json | logger = "c.m.payment.service.BillingService"

6. Alert Rules

Critical Alerts (PagerDuty / SMS)

#	Alert	Condition	For	Severity
1	Pod Down	`kube_deployment_status_replicas_available{deployment="membership-api"} < 1`	1m	Critical
2	High Error Rate	`rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) > 0.05`	5m	Critical
3	OOMKilled	`kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0`	0m	Critical
4	Backup Failed	`time() - membership_backup_last_success_timestamp > 86400`	1h	Critical

Warning Alerts (Email / Slack)

#	Alert	Condition	For	Severity
5	High API Latency	`histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 2`	5m	Warning
6	DB Connection Saturation	`hikaricp_connections_active / hikaricp_connections_max > 0.8`	5m	Warning
7	Disk Usage High	`kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85`	15m	Warning
8	Certificate Expiry	`certmanager_certificate_expiration_timestamp_seconds - time() < 604800`	1h	Warning
9	RabbitMQ Queue Depth	`rabbitmq_queue_messages{queue="membership-notifications"} > 1000`	10m	Warning

Alertmanager Routing

# alertmanager.yml
route:
  receiver: 'slack-warnings'
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'
  - name: 'slack-warnings'
    slack_configs:
      - api_url: '<slack-webhook-url>'
        channel: '#membership-alerts'
        title: '{{ .CommonAnnotations.summary }}'

7. Icinga External Monitoring

Icinga performs external checks from outside the Kubernetes cluster to verify public-facing availability.

Checks

Check	Target	Interval	Expected
SSL Certificate	`api.membership-one.com`	6h	Valid, > 14 days remaining
SSL Certificate	`app.membership-one.com`	6h	Valid, > 14 days remaining
DNS Resolution	`api.membership-one.com`	5m	Resolves to Hetzner LB IP
HTTP Health	`https://api.membership-one.com/api/actuator/health`	1m	HTTP 200, `{"status":"UP"}`
SMTP Connectivity	Mail server	15m	Port 587 reachable
Cash360 API	`https://www.my-factura.com/api/v1/health`	5m	HTTP 200
RabbitMQ Management	Internal endpoint	5m	Port 15672 reachable
Redis Ping	Internal endpoint	1m	PONG response

Icinga Notification

Icinga sends notifications via: - Email — All alerts to ops@membership-one.com - Slack — Critical alerts to #membership-ops - PagerDuty — Escalation for unacknowledged critical alerts after 15 minutes

8. On-Call Rotation

Role	Responsibility	Contact
Primary On-Call	First responder for critical alerts	Rotates weekly
Secondary On-Call	Escalation after 15 minutes	Rotates weekly
Engineering Lead	Decision authority for rollbacks	Fixed

Escalation timeline: 1. T+0 — Alert fires, primary on-call notified 2. T+15m — No acknowledgment: secondary on-call notified 3. T+30m — No acknowledgment: engineering lead notified 4. T+1h — If unresolved: management notified