Support Runbook — Membership One
1. Overview
This runbook covers common operational scenarios, troubleshooting steps, and escalation procedures for the Membership One platform. It is intended for operations engineers, support staff, and on-call personnel.
Log Locations
| Source | Location | Access |
|---|---|---|
| Application logs | stdout (captured by Loki) | Grafana Explore |
| Kubernetes events | kubectl get events -n membership-prod |
kubectl |
| PostgreSQL logs | Hetzner Managed DB console | Hetzner Cloud |
| Nginx Ingress logs | kubectl logs -n ingress-nginx deployment/nginx-ingress-controller |
kubectl |
| Flyway history | flyway_schema_history table |
psql |
Quick Diagnostics
# Pod status
kubectl get pods -n membership-prod -o wide
# Recent events
kubectl get events -n membership-prod --sort-by='.lastTimestamp' | tail -20
# Application health
kubectl exec -n membership-prod deployment/membership-api -- \
curl -s localhost:8081/api/actuator/health | jq .
# Environment check
kubectl exec -n membership-prod deployment/membership-api -- env | sort
2. Scenario 1: Application Not Starting
Symptoms
- Pod status:
CrashLoopBackOfforError - Health endpoint returns no response
Diagnosis
# Check pod events
kubectl describe pod -n membership-prod -l app=membership-api
# Check startup logs
kubectl logs -n membership-prod -l app=membership-api --previous
# Check resource limits
kubectl top pod -n membership-prod
Common Causes and Fixes
A. Database connection failure
HikariPool-1 - Exception during pool initialization
- Verify DB_URL, DB_NAME, DB_USERNAME, DB_PASSWORD in sealed secret
- Check PostgreSQL is accepting connections:
pg_isready -h <host> -p 5432 - Check connection limits:
SELECT count(*) FROM pg_stat_activity; - Verify network policy allows pod-to-database traffic
B. Flyway migration failure
FlywayException: Validate failed: Migrations have failed validation
- Check current migration state:
SELECT * FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 5; - If a migration is marked as failed: fix the migration SQL, delete the failed row from
flyway_schema_history, redeploy - Never modify already-applied migrations; create a new corrective migration
C. Port conflict
Web server failed to start. Port 8081 was already in use.
- Ensure
SERVER_PORTmatches container port in deployment manifest - Check if another service is bound to the same port in the pod
D. JWT key not found
FileNotFoundException: /keys/private.pem
- Verify sealed secret contains JWT_PRIVATE_KEY and JWT_PUBLIC_KEY
- Check volume mount path in deployment.yaml matches
JWT_PRIVATE_KEY_PATH
3. Scenario 2: High API Latency
Symptoms
- Alert:
HighAPILatency(p95 > 2s for 5 minutes) - Users report slow responses
Diagnosis
# Check which endpoints are slow
# Grafana: API Performance dashboard > Slowest Endpoints panel
# Check DB connection pool
kubectl exec -n membership-prod deployment/membership-api -- \
curl -s localhost:8081/api/actuator/metrics/hikaricp.connections.active | jq .
# Check GC pressure
kubectl exec -n membership-prod deployment/membership-api -- \
curl -s localhost:8081/api/actuator/metrics/jvm.gc.pause | jq .
Common Causes and Fixes
A. Slow database queries
- Enable slow query logging: log_min_duration_statement = 500 in PostgreSQL
- Check for missing indexes: EXPLAIN ANALYZE <slow_query>;
- Check table bloat: SELECT pg_size_pretty(pg_total_relation_size('<table>'));
- Run VACUUM ANALYZE <table>; if needed
B. Connection pool exhaustion
- Default max pool size: 10 (HikariCP)
- If hikaricp_connections_pending > 5, increase pool: spring.datasource.hikari.maximum-pool-size=20
- Check for long-running transactions: SELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '30 seconds';
C. GC pressure
- If jvm.gc.pause p99 > 500ms, increase heap: -Xmx2g -Xms2g
- Check for memory leaks: enable heap dump on OOM (-XX:+HeapDumpOnOutOfMemoryError)
- Review Grafana JVM dashboard for memory trend
4. Scenario 3: Payment Failures
Symptoms
- Transactions stuck in
PENDINGorSUBMITTEDstatus - Alert: Cash360 circuit breaker open
- Billing cycle produces errors
Diagnosis
# Check circuit breaker state
kubectl exec -n membership-prod deployment/membership-api -- \
curl -s localhost:8081/api/actuator/health | jq '.components.cash360'
# Check recent payment errors in logs
# Loki: {namespace="membership-prod"} | json | logger =~ ".*payment.*" | level = "ERROR"
# Check transaction status distribution
# SQL: SELECT status, count(*) FROM transaction GROUP BY status;
Common Causes and Fixes
A. Cash360 API unreachable
- Check CASH360_API_URL and CASH360_API_KEY in environment
- Verify Cash360 health: curl -s https://www.my-factura.com/api/v1/health
- If Cash360 is down: circuit breaker will open automatically; transactions queue locally
- When Cash360 recovers: circuit breaker transitions half-open -> closed; polling service picks up stale transactions (every 15 minutes)
B. Webhook delivery failures
- Verify webhook endpoint is reachable from Cash360: https://api.membership-one.com/api/webhooks/cash360
- Check HMAC validation: ensure shared secret matches between Cash360 and Membership One
- Check Ingress logs for 4xx/5xx on webhook path
- If webhooks are missed: polling service compensates within 15 minutes
C. Billing cycle errors
- Check billing cycle logs: search for BillingService in Loki
- Common issue: contracts without valid bank accounts; these are logged and skipped
- Verify idempotency: re-running billing for the same period should not create duplicates
5. Scenario 4: Authentication Issues
Symptoms
- Users cannot log in
- JWT validation errors in logs
- Brute-force lockout
Diagnosis
# Check auth errors in logs
# Loki: {namespace="membership-prod"} | json | message =~ ".*auth.*|.*login.*|.*JWT.*"
# Check Redis (brute-force counters)
kubectl exec -n membership-prod deployment/membership-redis -- redis-cli KEYS "brute-force:*"
Common Causes and Fixes
A. JWT key mismatch
- Public/private key pair must match between instances
- Regenerate if needed: openssl genrsa -out private.pem 2048 && openssl rsa -in private.pem -pubout -out public.pem
- Update sealed secret and restart pods
B. Token expiry
- Access tokens expire after 15 minutes; refresh tokens after 7 days
- If users report frequent logouts: check clock skew between nodes (kubectl exec ... -- date)
- Verify NTP sync on all nodes
C. Brute-force lockout
- After 5 failed attempts, account is locked for 15 minutes
- To unlock manually: delete the Redis key brute-force:login:<email>
- To clear all lockouts: redis-cli KEYS "brute-force:*" | xargs redis-cli DEL
6. Scenario 5: Email Delivery Failures
Symptoms
- Verification emails not received
- Notification queue growing in RabbitMQ
Diagnosis
# Check RabbitMQ queue depth
kubectl exec -n membership-prod deployment/membership-rabbitmq -- \
rabbitmqctl list_queues name messages consumers
# Check SMTP errors in logs
# Loki: {namespace="membership-prod"} | json | logger =~ ".*EmailSender.*" | level = "ERROR"
Common Causes and Fixes
A. SMTP server unreachable
- Verify SMTP_HOST, SMTP_PORT, SMTP_USERNAME, SMTP_PASSWORD
- Test connectivity: telnet <SMTP_HOST> 587
- Check firewall/network policies allow outbound SMTP
B. Template rendering errors
- Check Thymeleaf template syntax in CommunicationTemplate entities
- Missing template variables cause rendering exceptions
- Verify template locale fallback: entity locale -> en -> default template
C. Rate limiting
- Bulk messages are rate-limited to 100/minute per entity (Bucket4j)
- If queue grows: this is expected behavior; messages will drain at the rate limit
- Check: SELECT count(*) FROM communication WHERE status = 'PENDING';
7. Scenario 6: Check-in Failures
Symptoms
- Members cannot check in via QR/NFC/BLE
- Check-in endpoint returns 403 or 422
Diagnosis
# Check recent check-in errors
# Loki: {namespace="membership-prod"} | json | logger =~ ".*checkin.*" | level = "ERROR"
# Check access zone configuration
# SQL: SELECT * FROM access_zone WHERE id_entity = <entity_id>;
Common Causes and Fixes
A. Invalid QR code - QR codes are immutable once generated; if member's credential changes, a new QR is issued - Verify QR payload matches credential in database - Check QR code expiry if time-limited tokens are used
B. Access zone misconfiguration
- Verify access rules exist for the member's contract type and the target zone
- Check day-of-week and time-of-day restrictions in AccessRule
- Verify zone is active: SELECT active FROM access_zone WHERE id = <zone_id>;
C. Anti-passback violation
- Member must check out before checking in again at the same zone
- Check last check-in: SELECT * FROM check_in WHERE id_member = <id> ORDER BY check_in_time DESC LIMIT 5;
- Override: manually create a checkout record if needed
8. Scenario 7: Billing Discrepancies
Symptoms
- Members report incorrect invoices
- Duplicate charges or missing invoices
Diagnosis
# Check billing logs for the specific member
# Loki: {namespace="membership-prod"} | json | memberId = "<member_id>" | logger =~ ".*billing.*"
# Check transaction history
# SQL: SELECT * FROM transaction WHERE id_member = <id> ORDER BY created_at DESC;
Common Causes and Fixes
A. Duplicate billing
- Billing service uses idempotency keys: billing:{entityId}:{contractId}:{period}
- If duplicate exists: investigate Redis idempotency cache expiry
- Fix: create a storno (credit note) for the duplicate via /api/transactions/{id}/storno
B. Missing invoice
- Check if contract was active during the billing period
- Check if billing cycle ran successfully: search logs for BillingService at the expected time
- Manual trigger: POST /api/billing/trigger with the specific entity ID
C. Storno (cancellation) - Full storno: creates a credit note reversing the entire invoice - Partial storno: creates a credit note for a specific amount - Both automatically create reversal accounting entries
9. Scenario 8: Import Failures
Symptoms
- CSV import shows validation errors
- Import job stuck in
PROCESSINGstatus
Diagnosis
# Check import job status
# SQL: SELECT * FROM import_job WHERE id = <job_id>;
# Check validation errors
# SQL: SELECT error_details FROM import_job WHERE id = <job_id>;
Common Causes and Fixes
A. CSV format issues
- Expected: UTF-8 encoding, comma or semicolon delimiter
- Check for BOM marker: file -bi <file.csv> (should show charset=utf-8)
- Validate header row matches import template mapping
B. Validation errors - IBAN checksum failure: verify IBAN format (ISO 13616, Mod 97) - Duplicate email: member with same email already exists - Missing required fields: check template mapping for required columns - Fix: correct the CSV, re-upload, use dry-run mode first
10. Scenario 9: Out of Memory
Symptoms
- Alert:
OOMKilled - Pod restarts with exit code 137
Diagnosis
# Check pod restart reason
kubectl describe pod -n membership-prod -l app=membership-api | grep -A 5 "Last State"
# Check memory usage trend in Grafana JVM dashboard
# Check heap dump if configured
Common Causes and Fixes
A. Insufficient heap
- Default: -Xmx768m (dev), -Xmx2g (prod)
- Container memory limit must be ~30% higher than JVM max heap (for metaspace, threads, native)
- Increase: update resources.limits.memory in Helm values and JAVA_OPTS in ConfigMap
B. Memory leak
- Enable: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof
- Analyze with Eclipse MAT or VisualVM
- Common causes: unclosed streams, cached collections growing unbounded, large result sets without pagination
11. Scenario 10: Database Issues
Symptoms
- Slow queries, connection timeouts, table locks
Diagnosis
# Active queries
psql -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;"
# Table locks
psql -c "SELECT * FROM pg_locks WHERE NOT granted;"
# Table sizes
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"
Common Causes and Fixes
A. Slow queries
- Add missing indexes based on EXPLAIN ANALYZE output
- Check for sequential scans on large tables: pg_stat_user_tables.seq_scan
- Consider partial indexes for status-filtered queries
B. Lock contention
- Identify blocking query: SELECT * FROM pg_stat_activity WHERE pid IN (SELECT pid FROM pg_locks WHERE NOT granted);
- Kill blocking session if safe: SELECT pg_terminate_backend(<pid>);
- Review application code for long-running transactions
C. Autovacuum lag
- Check dead tuples: SELECT relname, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;
- Manual vacuum: VACUUM ANALYZE <table>;
- Tune autovacuum for high-write tables (member, transaction, check_in)
12. Escalation Matrix
| Level | Who | When | Response Time |
|---|---|---|---|
| L1 | On-call engineer | First response | < 15 minutes (critical), < 1 hour (warning) |
| L2 | Backend developer | L1 cannot resolve, code-level issue | < 1 hour |
| L3 | Engineering lead | Architecture decisions, rollback authorization | < 2 hours |
| L4 | CTO / Management | Data loss, prolonged outage, security breach | Immediately |
Emergency Contacts
| Role | Name | Phone | |
|---|---|---|---|
| Primary On-Call | (see rotation) | — | ops@membership-one.com |
| Engineering Lead | (TBD) | — | dev@membership-one.com |
| Hetzner Support | — | — | support@hetzner.com |
| Cash360 Support | — | — | support@my-factura.com |
Incident Classification
| Severity | Description | Example | SLA |
|---|---|---|---|
| SEV-1 | Service down, all users affected | Production pods crash loop | RTO < 1h |
| SEV-2 | Major feature broken | Payments not processing | RTO < 4h |
| SEV-3 | Minor feature degraded | Slow search, missing emails | RTO < 24h |
| SEV-4 | Cosmetic / non-urgent | Dashboard display issue | Next sprint |