IT Operations Guide — Membership One

1. Overview

This guide covers infrastructure management, backup procedures, scaling, security hardening, and disaster recovery for the Membership One platform hosted on Hetzner Cloud.

2. Infrastructure Architecture

Hetzner Cloud Components

                    Internet
                       |
              [Cloudflare DNS/WAF]
                       |
              [Hetzner Load Balancer]
                   LB11 (EUR 6/mo)
                       |
        +------+-------+-------+------+
        |      |       |       |      |
    [App-1] [App-2] [App-3] [Infra-1] [Infra-2]
     CX32    CX32    CX32    CX22     CX22
      |       |       |       |        |
      +-------+-------+       +--------+
              |                    |
    Membership Pods         Monitoring Stack
    (API + Web)            (Prometheus, Grafana,
                            Loki, Alertmanager,
                            Icinga)
              |
    +----+----+----+----+
    |    |         |    |
  [PG]  [Redis] [RMQ] [MinIO]
  CPX21  CX11   CX11  Object
                       Storage

Monthly Cost Estimate

Resource	Type	Cost/mo
3x App nodes	CX32 (4 vCPU, 8 GB)	3 x EUR 16 = EUR 48
2x Infra nodes	CX22 (2 vCPU, 4 GB)	2 x EUR 6 = EUR 12
Managed PostgreSQL	CPX21	EUR 15
Load Balancer	LB11	EUR 6
Object Storage	100 GB	EUR 3
Floating IPs	2x	EUR 5
Snapshots/Backups	—	EUR 5
Total		~EUR 94/mo

Network Configuration

Service	Port	Protocol	Access
API (Ingress)	443	HTTPS	Public
PostgreSQL	5432	TCP	Internal (K8s network)
Redis	6379	TCP	Internal
RabbitMQ	5672	AMQP	Internal
RabbitMQ Management	15672	HTTP	Internal
MinIO	9000	HTTP	Internal
Prometheus	9090	HTTP	Internal
Grafana	3000	HTTP	Internal (+ Ingress)

3. Backup Strategy

Overview

Data Source	Method	Schedule	Retention	Storage
PostgreSQL	`pg_dump` + Hetzner PITR	Daily 02:00 UTC	30 days	Hetzner Object Storage
PostgreSQL WAL	Continuous archiving	Continuous	7 days	Hetzner Object Storage
Kubernetes state	Velero	Daily 03:00 UTC	14 days	Hetzner Object Storage
MinIO documents	Cross-location replication	Continuous	Same as source	Hetzner Object Storage (secondary location)
Sealed Secrets keys	Manual export	After rotation	Indefinite	Vaultwarden
Helm values	Git (GitLab)	On change	Indefinite	GitLab

PostgreSQL Backup

# Manual backup
kubectl exec -n membership-prod deployment/membership-db-backup -- \
  pg_dump -h <DB_HOST> -U membership -Fc membership > backup-$(date +%Y%m%d).dump

# Restore from backup
pg_restore -h <DB_HOST> -U membership -d membership -c backup-20260223.dump

# Point-in-time recovery (Hetzner managed)
# Use Hetzner Cloud Console > Databases > Restore > Select timestamp

Velero Backup

# Install Velero
velero install --provider aws \
  --bucket membership-backups \
  --backup-location-config s3Url=https://s3.hetzner.cloud,region=eu-central \
  --use-volume-snapshots=false

# Manual backup
velero backup create membership-manual-$(date +%Y%m%d) \
  --include-namespaces membership-prod

# Schedule daily backup
velero schedule create membership-daily \
  --schedule="0 3 * * *" \
  --include-namespaces membership-prod \
  --ttl 336h

# Restore
velero restore create --from-backup membership-manual-20260223

Backup Verification

Run monthly restore tests:

Restore PostgreSQL backup to a test database
Verify row counts match production
Run application health check against restored data
Document results in backup verification log

4. Scaling Procedures

Horizontal Scaling (HPA)

The Horizontal Pod Autoscaler is configured in production:

# values-production.yaml
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 6
  targetCPUUtilizationPercentage: 70

# Check HPA status
kubectl get hpa -n membership-prod

# Manual scale (override HPA temporarily)
kubectl scale deployment membership-api -n membership-prod --replicas=5

# Reset to HPA control
kubectl annotate deployment membership-api -n membership-prod \
  kubectl.kubernetes.io/last-applied-configuration-

Vertical Scaling (Node Upgrade)

To upgrade Hetzner nodes:

Cordon the node: kubectl cordon <node-name>
Drain the node: kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
Upgrade in Hetzner Cloud Console (e.g., CX32 -> CX42)
Uncordon: kubectl uncordon <node-name>
Verify pods reschedule: kubectl get pods -n membership-prod -o wide

Database Scaling

Read replicas: Add Hetzner managed read replicas for read-heavy queries
Connection pooling: PgBouncer in front of PostgreSQL if connection count exceeds 200
Vertical: Upgrade CPX21 -> CPX31 via Hetzner Cloud Console (brief downtime)

5. Security Hardening

Kubernetes Security

# Pod Security Standards (restricted)
apiVersion: v1
kind: Namespace
metadata:
  name: membership-prod
  labels:
    pod-security.kubernetes.io/enforce: restricted

Network Policies: Restrict pod-to-pod communication; only API pods can reach DB/Redis/RabbitMQ
RBAC: Minimal service account permissions; no cluster-admin for application pods
Image Scanning: GitLab Container Scanning in CI pipeline
No root containers: runAsNonRoot: true, readOnlyRootFilesystem: true

Secret Management

Secret Type	Tool	Rotation
Application secrets	Sealed Secrets (Bitnami)	Quarterly
Team passwords	Vaultwarden	On change
TLS certificates	Dehydrated + cert-manager (auto)	Auto (Let's Encrypt, 90-day)
JWT signing keys	Sealed Secrets	Annually
Database passwords	Sealed Secrets	Quarterly

# Rotate a sealed secret
kubectl create secret generic membership-secrets \
  --from-literal=DB_PASSWORD='<new-password>' \
  --dry-run=client -o yaml | \
  kubeseal --format yaml > sealed-secret.yaml

kubectl apply -f sealed-secret.yaml
kubectl rollout restart deployment/membership-api -n membership-prod

Firewall Rules (Hetzner Cloud Firewall)

Rule	Direction	Protocol	Port	Source	Action
HTTPS	Inbound	TCP	443	0.0.0.0/0	Allow
SSH	Inbound	TCP	22	Office IP only	Allow
Kubernetes API	Inbound	TCP	6443	Office IP + CI runner	Allow
Node ports	Inbound	TCP	30000-32767	Internal	Allow
All other	Inbound	—	—	—	Drop
All	Outbound	—	—	—	Allow

TLS Certificate Management

Certificates are managed by two complementary tools:

cert-manager (in-cluster): Provisions Let's Encrypt certificates for Ingress resources automatically
Dehydrated (external): Manages origin certificates via DNS-01 challenge (Cloudflare) for non-Ingress services

# Check certificate status
kubectl get certificates -n membership-prod
kubectl describe certificate membership-tls -n membership-prod

# Icinga monitors certificate expiry externally
# Alert threshold: < 14 days remaining

6. Disaster Recovery

Recovery Targets

Metric	Target	Method
RTO (Recovery Time Objective)	4 hours	Velero restore + DB restore
RPO (Recovery Point Objective)	1 hour	PostgreSQL WAL archiving

DR Procedure

Scenario: Complete cluster loss

Assess (T+0 to T+15m) - Verify the outage scope (single node vs. full cluster) - Notify stakeholders via incident channel
Provision new cluster (T+15m to T+1h) - Create new Hetzner K8s cluster via Terraform/Console - Install prerequisites: ingress-nginx, cert-manager, sealed-secrets
Restore data (T+1h to T+2h) - Restore PostgreSQL from latest backup (Hetzner PITR or pg_restore) - Restore MinIO documents from Object Storage replica - Restore Kubernetes state from Velero backup
Deploy application (T+2h to T+3h) - Apply sealed secrets - Deploy via Helm: helm upgrade --install membership ./helm/membership -f values-production.yaml - Verify health endpoints
Validate (T+3h to T+4h) - Run smoke tests - Verify data integrity (row counts, recent transactions) - Update DNS if load balancer IP changed - Monitor for 30 minutes
Post-incident (T+4h+) - Write incident report (root cause, timeline, actions) - Update DR procedure if gaps found - Schedule DR drill for next quarter

DR Drills

Conduct quarterly DR drills:

Restore database backup to test environment
Deploy application against restored data
Run full smoke test suite
Measure actual RTO and RPO
Document results and improvement actions

7. Maintenance Windows

Scheduled Maintenance

Activity	Schedule	Duration	Impact
OS security patches	Weekly, Sunday 04:00 UTC	15 min/node	Rolling, no downtime
Kubernetes upgrade	Monthly, 1st Sunday 03:00 UTC	30 min	Rolling, brief API blips
PostgreSQL minor upgrade	Quarterly	5 min	Brief DB reconnect
PostgreSQL major upgrade	Annually	1 hour	Planned downtime
Sealed Secrets rotation	Quarterly	10 min	Pod restart
DR drill	Quarterly	2 hours	No production impact

Maintenance Procedure

Announce maintenance window 48 hours in advance (status page + email)
Create pre-maintenance backup
Perform maintenance
Verify health endpoints and dashboards
Update status page

8. Useful Commands Reference

# --- Kubernetes ---
kubectl get pods -n membership-prod -o wide
kubectl logs -f deployment/membership-api -n membership-prod
kubectl exec -it deployment/membership-api -n membership-prod -- /bin/sh
kubectl top pods -n membership-prod
kubectl rollout restart deployment/membership-api -n membership-prod

# --- Helm ---
helm list -n membership-prod
helm history membership -n membership-prod
helm rollback membership <revision> -n membership-prod

# --- PostgreSQL ---
psql -h <host> -U membership -d membership
SELECT pg_size_pretty(pg_database_size('membership'));
SELECT * FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 10;

# --- Redis ---
redis-cli -h <host> DBSIZE
redis-cli -h <host> INFO memory
redis-cli -h <host> KEYS "brute-force:*"

# --- RabbitMQ ---
rabbitmqctl list_queues name messages consumers
rabbitmqctl list_connections