Monitoring Guide — Membership One
1. Overview
Membership One uses a comprehensive monitoring stack to ensure system reliability, performance, and business observability.
Monitoring Stack
| Component | Purpose | Access |
|---|---|---|
| Prometheus | Metrics collection and storage | http://prometheus.membership-one.com |
| Grafana | Dashboards and visualization | http://grafana.membership-one.com |
| Loki | Log aggregation | Via Grafana (Explore) |
| Alertmanager | Alert routing and notification | Via Prometheus |
| Icinga | External uptime monitoring | http://icinga.membership-one.com |
| Spring Actuator | Application metrics endpoint | /api/actuator/prometheus |
2. Spring Actuator Metrics
The backend exposes Micrometer Prometheus metrics at /api/actuator/prometheus.
Key Metrics
| Metric | Description | Type |
|---|---|---|
jvm_memory_used_bytes |
JVM heap/non-heap memory usage | Gauge |
jvm_gc_pause_seconds |
GC pause duration | Summary |
jvm_threads_live_threads |
Active thread count | Gauge |
http_server_requests_seconds |
HTTP request duration by method/status/uri | Timer |
hikaricp_connections_active |
Active DB connections | Gauge |
hikaricp_connections_idle |
Idle DB connections | Gauge |
hikaricp_connections_pending |
Pending DB connections | Gauge |
spring_data_repository_invocations_seconds |
Repository query time | Timer |
resilience4j_circuitbreaker_state |
Cash360 circuit breaker state | Gauge |
resilience4j_retry_calls_total |
Cash360 retry count | Counter |
membership_billing_cycle_duration_seconds |
Billing cycle execution time | Timer |
membership_checkin_total |
Check-in count by method/zone | Counter |
membership_active_members |
Total active members | Gauge |
membership_active_contracts |
Total active contracts | Gauge |
Custom Business Metrics
Register custom metrics in MembershipMetrics.java using Micrometer:
@Component
public class MembershipMetrics {
private final Counter checkinCounter;
private final Timer billingTimer;
public MembershipMetrics(MeterRegistry registry) {
this.checkinCounter = Counter.builder("membership.checkin.total")
.tag("method", "qr").register(registry);
this.billingTimer = Timer.builder("membership.billing.cycle.duration")
.register(registry);
}
}
3. Prometheus Configuration
Scrape Config
# prometheus.yml
scrape_configs:
- job_name: 'membership-api'
kubernetes_sd_configs:
- role: pod
namespaces:
names: ['membership-prod']
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
target_label: __address__
regex: (.+)
replacement: ${1}:8081
metrics_path: /api/actuator/prometheus
scrape_interval: 15s
Pod Annotations
The Helm chart configures these annotations on pods:
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/api/actuator/prometheus"
prometheus.io/port: "8081"
4. Grafana Dashboards
Dashboard 1: JVM and Application Health
- JVM Heap Usage — used vs. committed vs. max (line chart)
- GC Pause Duration — p50, p95, p99 (histogram)
- Thread Count — live, daemon, peak (line chart)
- CPU Usage —
process_cpu_usagevs.system_cpu_usage(gauge) - Uptime —
process_uptime_seconds(stat panel) - Active Profiles —
spring_profiles_active(text panel)
Dashboard 2: API Performance
- Request Rate — requests/second by status code (line chart)
- Response Time — p50, p95, p99 by endpoint (heatmap)
- Error Rate — 4xx and 5xx as percentage (gauge, threshold: <1%)
- Slowest Endpoints — top 10 by avg duration (table)
- Request Size — incoming payload sizes (histogram)
Dashboard 3: Database and Connections
- Active DB Connections — vs. pool max (line chart, threshold: <80% max)
- Pending Connections — queue depth (line chart, alert if > 5)
- Connection Wait Time —
hikaricp_connections_acquire_seconds(histogram) - Query Duration — repository invocation times by method (heatmap)
- Flyway Migration Status — version, applied count (stat panel)
Dashboard 4: Business KPIs
- Active Members — total count (stat panel)
- New Registrations — daily/weekly trend (bar chart)
- Check-ins Today — by method QR/NFC/BLE (pie chart)
- Monthly Recurring Revenue — from active contracts (stat panel)
- Open Debt — total outstanding amount (gauge)
- Contract Conversion — trial to paid ratio (gauge)
- Cash360 Circuit Breaker — open/half-open/closed state (status map)
Dashboard 5: Infrastructure
- Pod Status — running/pending/failed (status grid)
- CPU/Memory per Pod — resource usage vs. limits (line chart)
- Node Resource Usage — Hetzner node utilization (bar chart)
- Persistent Volume — disk usage (gauge)
- Network I/O — bytes in/out per pod (line chart)
5. Loki Log Aggregation
Log Format
The application outputs structured JSON logs to stdout:
{
"timestamp": "2026-02-23T10:15:30.123Z",
"level": "INFO",
"logger": "c.m.payment.service.BillingService",
"message": "Billing cycle completed",
"traceId": "abc123def456",
"entityId": 42,
"processedCount": 150,
"duration": 3200
}
Useful LogQL Queries
# All errors in the last hour
{namespace="membership-prod", app="membership-api"} |= "ERROR"
# Payment failures
{namespace="membership-prod"} | json | logger =~ ".*payment.*" | level = "ERROR"
# Slow requests (> 2s)
{namespace="membership-prod"} | json | duration > 2000
# Authentication failures
{namespace="membership-prod"} | json | message =~ ".*authentication failed.*"
# Billing cycle results
{namespace="membership-prod"} | json | logger = "c.m.payment.service.BillingService"
6. Alert Rules
Critical Alerts (PagerDuty / SMS)
| # | Alert | Condition | For | Severity |
|---|---|---|---|---|
| 1 | Pod Down | kube_deployment_status_replicas_available{deployment="membership-api"} < 1 |
1m | Critical |
| 2 | High Error Rate | rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) > 0.05 |
5m | Critical |
| 3 | OOMKilled | kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0 |
0m | Critical |
| 4 | Backup Failed | time() - membership_backup_last_success_timestamp > 86400 |
1h | Critical |
Warning Alerts (Email / Slack)
| # | Alert | Condition | For | Severity |
|---|---|---|---|---|
| 5 | High API Latency | histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 2 |
5m | Warning |
| 6 | DB Connection Saturation | hikaricp_connections_active / hikaricp_connections_max > 0.8 |
5m | Warning |
| 7 | Disk Usage High | kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85 |
15m | Warning |
| 8 | Certificate Expiry | certmanager_certificate_expiration_timestamp_seconds - time() < 604800 |
1h | Warning |
| 9 | RabbitMQ Queue Depth | rabbitmq_queue_messages{queue="membership-notifications"} > 1000 |
10m | Warning |
Alertmanager Routing
# alertmanager.yml
route:
receiver: 'slack-warnings'
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'slack-warnings'
slack_configs:
- api_url: '<slack-webhook-url>'
channel: '#membership-alerts'
title: '{{ .CommonAnnotations.summary }}'
7. Icinga External Monitoring
Icinga performs external checks from outside the Kubernetes cluster to verify public-facing availability.
Checks
| Check | Target | Interval | Expected |
|---|---|---|---|
| SSL Certificate | api.membership-one.com |
6h | Valid, > 14 days remaining |
| SSL Certificate | app.membership-one.com |
6h | Valid, > 14 days remaining |
| DNS Resolution | api.membership-one.com |
5m | Resolves to Hetzner LB IP |
| HTTP Health | https://api.membership-one.com/api/actuator/health |
1m | HTTP 200, {"status":"UP"} |
| SMTP Connectivity | Mail server | 15m | Port 587 reachable |
| Cash360 API | https://www.my-factura.com/api/v1/health |
5m | HTTP 200 |
| RabbitMQ Management | Internal endpoint | 5m | Port 15672 reachable |
| Redis Ping | Internal endpoint | 1m | PONG response |
Icinga Notification
Icinga sends notifications via:
- Email — All alerts to ops@membership-one.com
- Slack — Critical alerts to #membership-ops
- PagerDuty — Escalation for unacknowledged critical alerts after 15 minutes
8. On-Call Rotation
| Role | Responsibility | Contact |
|---|---|---|
| Primary On-Call | First responder for critical alerts | Rotates weekly |
| Secondary On-Call | Escalation after 15 minutes | Rotates weekly |
| Engineering Lead | Decision authority for rollbacks | Fixed |
Escalation timeline:
1. T+0 — Alert fires, primary on-call notified
2. T+15m — No acknowledgment: secondary on-call notified
3. T+30m — No acknowledgment: engineering lead notified
4. T+1h — If unresolved: management notified