Monitoring Guide — Membership One

1. Overview

Membership One uses a comprehensive monitoring stack to ensure system reliability, performance, and business observability.

Monitoring Stack

Component Purpose Access
Prometheus Metrics collection and storage http://prometheus.membership-one.com
Grafana Dashboards and visualization http://grafana.membership-one.com
Loki Log aggregation Via Grafana (Explore)
Alertmanager Alert routing and notification Via Prometheus
Icinga External uptime monitoring http://icinga.membership-one.com
Spring Actuator Application metrics endpoint /api/actuator/prometheus

2. Spring Actuator Metrics

The backend exposes Micrometer Prometheus metrics at /api/actuator/prometheus.

Key Metrics

Metric Description Type
jvm_memory_used_bytes JVM heap/non-heap memory usage Gauge
jvm_gc_pause_seconds GC pause duration Summary
jvm_threads_live_threads Active thread count Gauge
http_server_requests_seconds HTTP request duration by method/status/uri Timer
hikaricp_connections_active Active DB connections Gauge
hikaricp_connections_idle Idle DB connections Gauge
hikaricp_connections_pending Pending DB connections Gauge
spring_data_repository_invocations_seconds Repository query time Timer
resilience4j_circuitbreaker_state Cash360 circuit breaker state Gauge
resilience4j_retry_calls_total Cash360 retry count Counter
membership_billing_cycle_duration_seconds Billing cycle execution time Timer
membership_checkin_total Check-in count by method/zone Counter
membership_active_members Total active members Gauge
membership_active_contracts Total active contracts Gauge

Custom Business Metrics

Register custom metrics in MembershipMetrics.java using Micrometer:

@Component
public class MembershipMetrics {
    private final Counter checkinCounter;
    private final Timer billingTimer;

    public MembershipMetrics(MeterRegistry registry) {
        this.checkinCounter = Counter.builder("membership.checkin.total")
            .tag("method", "qr").register(registry);
        this.billingTimer = Timer.builder("membership.billing.cycle.duration")
            .register(registry);
    }
}

3. Prometheus Configuration

Scrape Config

# prometheus.yml
scrape_configs:
  - job_name: 'membership-api'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['membership-prod']
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        target_label: __metrics_path__
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        target_label: __address__
        regex: (.+)
        replacement: ${1}:8081
    metrics_path: /api/actuator/prometheus
    scrape_interval: 15s

Pod Annotations

The Helm chart configures these annotations on pods:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/path: "/api/actuator/prometheus"
  prometheus.io/port: "8081"

4. Grafana Dashboards

Dashboard 1: JVM and Application Health

  • JVM Heap Usage — used vs. committed vs. max (line chart)
  • GC Pause Duration — p50, p95, p99 (histogram)
  • Thread Count — live, daemon, peak (line chart)
  • CPU Usageprocess_cpu_usage vs. system_cpu_usage (gauge)
  • Uptimeprocess_uptime_seconds (stat panel)
  • Active Profilesspring_profiles_active (text panel)

Dashboard 2: API Performance

  • Request Rate — requests/second by status code (line chart)
  • Response Time — p50, p95, p99 by endpoint (heatmap)
  • Error Rate — 4xx and 5xx as percentage (gauge, threshold: <1%)
  • Slowest Endpoints — top 10 by avg duration (table)
  • Request Size — incoming payload sizes (histogram)

Dashboard 3: Database and Connections

  • Active DB Connections — vs. pool max (line chart, threshold: <80% max)
  • Pending Connections — queue depth (line chart, alert if > 5)
  • Connection Wait Timehikaricp_connections_acquire_seconds (histogram)
  • Query Duration — repository invocation times by method (heatmap)
  • Flyway Migration Status — version, applied count (stat panel)

Dashboard 4: Business KPIs

  • Active Members — total count (stat panel)
  • New Registrations — daily/weekly trend (bar chart)
  • Check-ins Today — by method QR/NFC/BLE (pie chart)
  • Monthly Recurring Revenue — from active contracts (stat panel)
  • Open Debt — total outstanding amount (gauge)
  • Contract Conversion — trial to paid ratio (gauge)
  • Cash360 Circuit Breaker — open/half-open/closed state (status map)

Dashboard 5: Infrastructure

  • Pod Status — running/pending/failed (status grid)
  • CPU/Memory per Pod — resource usage vs. limits (line chart)
  • Node Resource Usage — Hetzner node utilization (bar chart)
  • Persistent Volume — disk usage (gauge)
  • Network I/O — bytes in/out per pod (line chart)

5. Loki Log Aggregation

Log Format

The application outputs structured JSON logs to stdout:

{
  "timestamp": "2026-02-23T10:15:30.123Z",
  "level": "INFO",
  "logger": "c.m.payment.service.BillingService",
  "message": "Billing cycle completed",
  "traceId": "abc123def456",
  "entityId": 42,
  "processedCount": 150,
  "duration": 3200
}

Useful LogQL Queries

# All errors in the last hour
{namespace="membership-prod", app="membership-api"} |= "ERROR"

# Payment failures
{namespace="membership-prod"} | json | logger =~ ".*payment.*" | level = "ERROR"

# Slow requests (> 2s)
{namespace="membership-prod"} | json | duration > 2000

# Authentication failures
{namespace="membership-prod"} | json | message =~ ".*authentication failed.*"

# Billing cycle results
{namespace="membership-prod"} | json | logger = "c.m.payment.service.BillingService"

6. Alert Rules

Critical Alerts (PagerDuty / SMS)

# Alert Condition For Severity
1 Pod Down kube_deployment_status_replicas_available{deployment="membership-api"} < 1 1m Critical
2 High Error Rate rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) > 0.05 5m Critical
3 OOMKilled kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0 0m Critical
4 Backup Failed time() - membership_backup_last_success_timestamp > 86400 1h Critical

Warning Alerts (Email / Slack)

# Alert Condition For Severity
5 High API Latency histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 2 5m Warning
6 DB Connection Saturation hikaricp_connections_active / hikaricp_connections_max > 0.8 5m Warning
7 Disk Usage High kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85 15m Warning
8 Certificate Expiry certmanager_certificate_expiration_timestamp_seconds - time() < 604800 1h Warning
9 RabbitMQ Queue Depth rabbitmq_queue_messages{queue="membership-notifications"} > 1000 10m Warning

Alertmanager Routing

# alertmanager.yml
route:
  receiver: 'slack-warnings'
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'
  - name: 'slack-warnings'
    slack_configs:
      - api_url: '<slack-webhook-url>'
        channel: '#membership-alerts'
        title: '{{ .CommonAnnotations.summary }}'

7. Icinga External Monitoring

Icinga performs external checks from outside the Kubernetes cluster to verify public-facing availability.

Checks

Check Target Interval Expected
SSL Certificate api.membership-one.com 6h Valid, > 14 days remaining
SSL Certificate app.membership-one.com 6h Valid, > 14 days remaining
DNS Resolution api.membership-one.com 5m Resolves to Hetzner LB IP
HTTP Health https://api.membership-one.com/api/actuator/health 1m HTTP 200, {"status":"UP"}
SMTP Connectivity Mail server 15m Port 587 reachable
Cash360 API https://www.my-factura.com/api/v1/health 5m HTTP 200
RabbitMQ Management Internal endpoint 5m Port 15672 reachable
Redis Ping Internal endpoint 1m PONG response

Icinga Notification

Icinga sends notifications via: - Email — All alerts to ops@membership-one.com - Slack — Critical alerts to #membership-ops - PagerDuty — Escalation for unacknowledged critical alerts after 15 minutes

8. On-Call Rotation

Role Responsibility Contact
Primary On-Call First responder for critical alerts Rotates weekly
Secondary On-Call Escalation after 15 minutes Rotates weekly
Engineering Lead Decision authority for rollbacks Fixed

Escalation timeline: 1. T+0 — Alert fires, primary on-call notified 2. T+15m — No acknowledgment: secondary on-call notified 3. T+30m — No acknowledgment: engineering lead notified 4. T+1h — If unresolved: management notified