Support Runbook — Membership One

1. Overview

This runbook covers common operational scenarios, troubleshooting steps, and escalation procedures for the Membership One platform. It is intended for operations engineers, support staff, and on-call personnel.

Log Locations

Source Location Access
Application logs stdout (captured by Loki) Grafana Explore
Kubernetes events kubectl get events -n membership-prod kubectl
PostgreSQL logs Hetzner Managed DB console Hetzner Cloud
Nginx Ingress logs kubectl logs -n ingress-nginx deployment/nginx-ingress-controller kubectl
Flyway history flyway_schema_history table psql

Quick Diagnostics

# Pod status
kubectl get pods -n membership-prod -o wide

# Recent events
kubectl get events -n membership-prod --sort-by='.lastTimestamp' | tail -20

# Application health
kubectl exec -n membership-prod deployment/membership-api -- \
  curl -s localhost:8081/api/actuator/health | jq .

# Environment check
kubectl exec -n membership-prod deployment/membership-api -- env | sort

2. Scenario 1: Application Not Starting

Symptoms

  • Pod status: CrashLoopBackOff or Error
  • Health endpoint returns no response

Diagnosis

# Check pod events
kubectl describe pod -n membership-prod -l app=membership-api

# Check startup logs
kubectl logs -n membership-prod -l app=membership-api --previous

# Check resource limits
kubectl top pod -n membership-prod

Common Causes and Fixes

A. Database connection failure

HikariPool-1 - Exception during pool initialization
  • Verify DB_URL, DB_NAME, DB_USERNAME, DB_PASSWORD in sealed secret
  • Check PostgreSQL is accepting connections: pg_isready -h <host> -p 5432
  • Check connection limits: SELECT count(*) FROM pg_stat_activity;
  • Verify network policy allows pod-to-database traffic

B. Flyway migration failure

FlywayException: Validate failed: Migrations have failed validation
  • Check current migration state: SELECT * FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 5;
  • If a migration is marked as failed: fix the migration SQL, delete the failed row from flyway_schema_history, redeploy
  • Never modify already-applied migrations; create a new corrective migration

C. Port conflict

Web server failed to start. Port 8081 was already in use.
  • Ensure SERVER_PORT matches container port in deployment manifest
  • Check if another service is bound to the same port in the pod

D. JWT key not found

FileNotFoundException: /keys/private.pem
  • Verify sealed secret contains JWT_PRIVATE_KEY and JWT_PUBLIC_KEY
  • Check volume mount path in deployment.yaml matches JWT_PRIVATE_KEY_PATH

3. Scenario 2: High API Latency

Symptoms

  • Alert: HighAPILatency (p95 > 2s for 5 minutes)
  • Users report slow responses

Diagnosis

# Check which endpoints are slow
# Grafana: API Performance dashboard > Slowest Endpoints panel

# Check DB connection pool
kubectl exec -n membership-prod deployment/membership-api -- \
  curl -s localhost:8081/api/actuator/metrics/hikaricp.connections.active | jq .

# Check GC pressure
kubectl exec -n membership-prod deployment/membership-api -- \
  curl -s localhost:8081/api/actuator/metrics/jvm.gc.pause | jq .

Common Causes and Fixes

A. Slow database queries - Enable slow query logging: log_min_duration_statement = 500 in PostgreSQL - Check for missing indexes: EXPLAIN ANALYZE <slow_query>; - Check table bloat: SELECT pg_size_pretty(pg_total_relation_size('<table>')); - Run VACUUM ANALYZE <table>; if needed

B. Connection pool exhaustion - Default max pool size: 10 (HikariCP) - If hikaricp_connections_pending > 5, increase pool: spring.datasource.hikari.maximum-pool-size=20 - Check for long-running transactions: SELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '30 seconds';

C. GC pressure - If jvm.gc.pause p99 > 500ms, increase heap: -Xmx2g -Xms2g - Check for memory leaks: enable heap dump on OOM (-XX:+HeapDumpOnOutOfMemoryError) - Review Grafana JVM dashboard for memory trend


4. Scenario 3: Payment Failures

Symptoms

  • Transactions stuck in PENDING or SUBMITTED status
  • Alert: Cash360 circuit breaker open
  • Billing cycle produces errors

Diagnosis

# Check circuit breaker state
kubectl exec -n membership-prod deployment/membership-api -- \
  curl -s localhost:8081/api/actuator/health | jq '.components.cash360'

# Check recent payment errors in logs
# Loki: {namespace="membership-prod"} | json | logger =~ ".*payment.*" | level = "ERROR"

# Check transaction status distribution
# SQL: SELECT status, count(*) FROM transaction GROUP BY status;

Common Causes and Fixes

A. Cash360 API unreachable - Check CASH360_API_URL and CASH360_API_KEY in environment - Verify Cash360 health: curl -s https://www.my-factura.com/api/v1/health - If Cash360 is down: circuit breaker will open automatically; transactions queue locally - When Cash360 recovers: circuit breaker transitions half-open -> closed; polling service picks up stale transactions (every 15 minutes)

B. Webhook delivery failures - Verify webhook endpoint is reachable from Cash360: https://api.membership-one.com/api/webhooks/cash360 - Check HMAC validation: ensure shared secret matches between Cash360 and Membership One - Check Ingress logs for 4xx/5xx on webhook path - If webhooks are missed: polling service compensates within 15 minutes

C. Billing cycle errors - Check billing cycle logs: search for BillingService in Loki - Common issue: contracts without valid bank accounts; these are logged and skipped - Verify idempotency: re-running billing for the same period should not create duplicates


5. Scenario 4: Authentication Issues

Symptoms

  • Users cannot log in
  • JWT validation errors in logs
  • Brute-force lockout

Diagnosis

# Check auth errors in logs
# Loki: {namespace="membership-prod"} | json | message =~ ".*auth.*|.*login.*|.*JWT.*"

# Check Redis (brute-force counters)
kubectl exec -n membership-prod deployment/membership-redis -- redis-cli KEYS "brute-force:*"

Common Causes and Fixes

A. JWT key mismatch - Public/private key pair must match between instances - Regenerate if needed: openssl genrsa -out private.pem 2048 && openssl rsa -in private.pem -pubout -out public.pem - Update sealed secret and restart pods

B. Token expiry - Access tokens expire after 15 minutes; refresh tokens after 7 days - If users report frequent logouts: check clock skew between nodes (kubectl exec ... -- date) - Verify NTP sync on all nodes

C. Brute-force lockout - After 5 failed attempts, account is locked for 15 minutes - To unlock manually: delete the Redis key brute-force:login:<email> - To clear all lockouts: redis-cli KEYS "brute-force:*" | xargs redis-cli DEL


6. Scenario 5: Email Delivery Failures

Symptoms

  • Verification emails not received
  • Notification queue growing in RabbitMQ

Diagnosis

# Check RabbitMQ queue depth
kubectl exec -n membership-prod deployment/membership-rabbitmq -- \
  rabbitmqctl list_queues name messages consumers

# Check SMTP errors in logs
# Loki: {namespace="membership-prod"} | json | logger =~ ".*EmailSender.*" | level = "ERROR"

Common Causes and Fixes

A. SMTP server unreachable - Verify SMTP_HOST, SMTP_PORT, SMTP_USERNAME, SMTP_PASSWORD - Test connectivity: telnet <SMTP_HOST> 587 - Check firewall/network policies allow outbound SMTP

B. Template rendering errors - Check Thymeleaf template syntax in CommunicationTemplate entities - Missing template variables cause rendering exceptions - Verify template locale fallback: entity locale -> en -> default template

C. Rate limiting - Bulk messages are rate-limited to 100/minute per entity (Bucket4j) - If queue grows: this is expected behavior; messages will drain at the rate limit - Check: SELECT count(*) FROM communication WHERE status = 'PENDING';


7. Scenario 6: Check-in Failures

Symptoms

  • Members cannot check in via QR/NFC/BLE
  • Check-in endpoint returns 403 or 422

Diagnosis

# Check recent check-in errors
# Loki: {namespace="membership-prod"} | json | logger =~ ".*checkin.*" | level = "ERROR"

# Check access zone configuration
# SQL: SELECT * FROM access_zone WHERE id_entity = <entity_id>;

Common Causes and Fixes

A. Invalid QR code - QR codes are immutable once generated; if member's credential changes, a new QR is issued - Verify QR payload matches credential in database - Check QR code expiry if time-limited tokens are used

B. Access zone misconfiguration - Verify access rules exist for the member's contract type and the target zone - Check day-of-week and time-of-day restrictions in AccessRule - Verify zone is active: SELECT active FROM access_zone WHERE id = <zone_id>;

C. Anti-passback violation - Member must check out before checking in again at the same zone - Check last check-in: SELECT * FROM check_in WHERE id_member = <id> ORDER BY check_in_time DESC LIMIT 5; - Override: manually create a checkout record if needed


8. Scenario 7: Billing Discrepancies

Symptoms

  • Members report incorrect invoices
  • Duplicate charges or missing invoices

Diagnosis

# Check billing logs for the specific member
# Loki: {namespace="membership-prod"} | json | memberId = "<member_id>" | logger =~ ".*billing.*"

# Check transaction history
# SQL: SELECT * FROM transaction WHERE id_member = <id> ORDER BY created_at DESC;

Common Causes and Fixes

A. Duplicate billing - Billing service uses idempotency keys: billing:{entityId}:{contractId}:{period} - If duplicate exists: investigate Redis idempotency cache expiry - Fix: create a storno (credit note) for the duplicate via /api/transactions/{id}/storno

B. Missing invoice - Check if contract was active during the billing period - Check if billing cycle ran successfully: search logs for BillingService at the expected time - Manual trigger: POST /api/billing/trigger with the specific entity ID

C. Storno (cancellation) - Full storno: creates a credit note reversing the entire invoice - Partial storno: creates a credit note for a specific amount - Both automatically create reversal accounting entries


9. Scenario 8: Import Failures

Symptoms

  • CSV import shows validation errors
  • Import job stuck in PROCESSING status

Diagnosis

# Check import job status
# SQL: SELECT * FROM import_job WHERE id = <job_id>;

# Check validation errors
# SQL: SELECT error_details FROM import_job WHERE id = <job_id>;

Common Causes and Fixes

A. CSV format issues - Expected: UTF-8 encoding, comma or semicolon delimiter - Check for BOM marker: file -bi <file.csv> (should show charset=utf-8) - Validate header row matches import template mapping

B. Validation errors - IBAN checksum failure: verify IBAN format (ISO 13616, Mod 97) - Duplicate email: member with same email already exists - Missing required fields: check template mapping for required columns - Fix: correct the CSV, re-upload, use dry-run mode first


10. Scenario 9: Out of Memory

Symptoms

  • Alert: OOMKilled
  • Pod restarts with exit code 137

Diagnosis

# Check pod restart reason
kubectl describe pod -n membership-prod -l app=membership-api | grep -A 5 "Last State"

# Check memory usage trend in Grafana JVM dashboard
# Check heap dump if configured

Common Causes and Fixes

A. Insufficient heap - Default: -Xmx768m (dev), -Xmx2g (prod) - Container memory limit must be ~30% higher than JVM max heap (for metaspace, threads, native) - Increase: update resources.limits.memory in Helm values and JAVA_OPTS in ConfigMap

B. Memory leak - Enable: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof - Analyze with Eclipse MAT or VisualVM - Common causes: unclosed streams, cached collections growing unbounded, large result sets without pagination


11. Scenario 10: Database Issues

Symptoms

  • Slow queries, connection timeouts, table locks

Diagnosis

# Active queries
psql -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;"

# Table locks
psql -c "SELECT * FROM pg_locks WHERE NOT granted;"

# Table sizes
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"

Common Causes and Fixes

A. Slow queries - Add missing indexes based on EXPLAIN ANALYZE output - Check for sequential scans on large tables: pg_stat_user_tables.seq_scan - Consider partial indexes for status-filtered queries

B. Lock contention - Identify blocking query: SELECT * FROM pg_stat_activity WHERE pid IN (SELECT pid FROM pg_locks WHERE NOT granted); - Kill blocking session if safe: SELECT pg_terminate_backend(<pid>); - Review application code for long-running transactions

C. Autovacuum lag - Check dead tuples: SELECT relname, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10; - Manual vacuum: VACUUM ANALYZE <table>; - Tune autovacuum for high-write tables (member, transaction, check_in)


12. Escalation Matrix

Level Who When Response Time
L1 On-call engineer First response < 15 minutes (critical), < 1 hour (warning)
L2 Backend developer L1 cannot resolve, code-level issue < 1 hour
L3 Engineering lead Architecture decisions, rollback authorization < 2 hours
L4 CTO / Management Data loss, prolonged outage, security breach Immediately

Emergency Contacts

Role Name Phone Email
Primary On-Call (see rotation) ops@membership-one.com
Engineering Lead (TBD) dev@membership-one.com
Hetzner Support support@hetzner.com
Cash360 Support support@my-factura.com

Incident Classification

Severity Description Example SLA
SEV-1 Service down, all users affected Production pods crash loop RTO < 1h
SEV-2 Major feature broken Payments not processing RTO < 4h
SEV-3 Minor feature degraded Slow search, missing emails RTO < 24h
SEV-4 Cosmetic / non-urgent Dashboard display issue Next sprint