IT Operations Guide — Membership One

1. Overview

This guide covers infrastructure management, backup procedures, scaling, security hardening, and disaster recovery for the Membership One platform hosted on Hetzner Cloud.

2. Infrastructure Architecture

Hetzner Cloud Components

                    Internet
                       |
              [Cloudflare DNS/WAF]
                       |
              [Hetzner Load Balancer]
                   LB11 (EUR 6/mo)
                       |
        +------+-------+-------+------+
        |      |       |       |      |
    [App-1] [App-2] [App-3] [Infra-1] [Infra-2]
     CX32    CX32    CX32    CX22     CX22
      |       |       |       |        |
      +-------+-------+       +--------+
              |                    |
    Membership Pods         Monitoring Stack
    (API + Web)            (Prometheus, Grafana,
                            Loki, Alertmanager,
                            Icinga)
              |
    +----+----+----+----+
    |    |         |    |
  [PG]  [Redis] [RMQ] [MinIO]
  CPX21  CX11   CX11  Object
                       Storage

Monthly Cost Estimate

Resource Type Cost/mo
3x App nodes CX32 (4 vCPU, 8 GB) 3 x EUR 16 = EUR 48
2x Infra nodes CX22 (2 vCPU, 4 GB) 2 x EUR 6 = EUR 12
Managed PostgreSQL CPX21 EUR 15
Load Balancer LB11 EUR 6
Object Storage 100 GB EUR 3
Floating IPs 2x EUR 5
Snapshots/Backups EUR 5
Total ~EUR 94/mo

Network Configuration

Service Port Protocol Access
API (Ingress) 443 HTTPS Public
PostgreSQL 5432 TCP Internal (K8s network)
Redis 6379 TCP Internal
RabbitMQ 5672 AMQP Internal
RabbitMQ Management 15672 HTTP Internal
MinIO 9000 HTTP Internal
Prometheus 9090 HTTP Internal
Grafana 3000 HTTP Internal (+ Ingress)

3. Backup Strategy

Overview

Data Source Method Schedule Retention Storage
PostgreSQL pg_dump + Hetzner PITR Daily 02:00 UTC 30 days Hetzner Object Storage
PostgreSQL WAL Continuous archiving Continuous 7 days Hetzner Object Storage
Kubernetes state Velero Daily 03:00 UTC 14 days Hetzner Object Storage
MinIO documents Cross-location replication Continuous Same as source Hetzner Object Storage (secondary location)
Sealed Secrets keys Manual export After rotation Indefinite Vaultwarden
Helm values Git (GitLab) On change Indefinite GitLab

PostgreSQL Backup

# Manual backup
kubectl exec -n membership-prod deployment/membership-db-backup -- \
  pg_dump -h <DB_HOST> -U membership -Fc membership > backup-$(date +%Y%m%d).dump

# Restore from backup
pg_restore -h <DB_HOST> -U membership -d membership -c backup-20260223.dump

# Point-in-time recovery (Hetzner managed)
# Use Hetzner Cloud Console > Databases > Restore > Select timestamp

Velero Backup

# Install Velero
velero install --provider aws \
  --bucket membership-backups \
  --backup-location-config s3Url=https://s3.hetzner.cloud,region=eu-central \
  --use-volume-snapshots=false

# Manual backup
velero backup create membership-manual-$(date +%Y%m%d) \
  --include-namespaces membership-prod

# Schedule daily backup
velero schedule create membership-daily \
  --schedule="0 3 * * *" \
  --include-namespaces membership-prod \
  --ttl 336h

# Restore
velero restore create --from-backup membership-manual-20260223

Backup Verification

Run monthly restore tests:

  1. Restore PostgreSQL backup to a test database
  2. Verify row counts match production
  3. Run application health check against restored data
  4. Document results in backup verification log

4. Scaling Procedures

Horizontal Scaling (HPA)

The Horizontal Pod Autoscaler is configured in production:

# values-production.yaml
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 6
  targetCPUUtilizationPercentage: 70
# Check HPA status
kubectl get hpa -n membership-prod

# Manual scale (override HPA temporarily)
kubectl scale deployment membership-api -n membership-prod --replicas=5

# Reset to HPA control
kubectl annotate deployment membership-api -n membership-prod \
  kubectl.kubernetes.io/last-applied-configuration-

Vertical Scaling (Node Upgrade)

To upgrade Hetzner nodes:

  1. Cordon the node: kubectl cordon <node-name>
  2. Drain the node: kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
  3. Upgrade in Hetzner Cloud Console (e.g., CX32 -> CX42)
  4. Uncordon: kubectl uncordon <node-name>
  5. Verify pods reschedule: kubectl get pods -n membership-prod -o wide

Database Scaling

  • Read replicas: Add Hetzner managed read replicas for read-heavy queries
  • Connection pooling: PgBouncer in front of PostgreSQL if connection count exceeds 200
  • Vertical: Upgrade CPX21 -> CPX31 via Hetzner Cloud Console (brief downtime)

5. Security Hardening

Kubernetes Security

# Pod Security Standards (restricted)
apiVersion: v1
kind: Namespace
metadata:
  name: membership-prod
  labels:
    pod-security.kubernetes.io/enforce: restricted
  • Network Policies: Restrict pod-to-pod communication; only API pods can reach DB/Redis/RabbitMQ
  • RBAC: Minimal service account permissions; no cluster-admin for application pods
  • Image Scanning: GitLab Container Scanning in CI pipeline
  • No root containers: runAsNonRoot: true, readOnlyRootFilesystem: true

Secret Management

Secret Type Tool Rotation
Application secrets Sealed Secrets (Bitnami) Quarterly
Team passwords Vaultwarden On change
TLS certificates Dehydrated + cert-manager (auto) Auto (Let's Encrypt, 90-day)
JWT signing keys Sealed Secrets Annually
Database passwords Sealed Secrets Quarterly
# Rotate a sealed secret
kubectl create secret generic membership-secrets \
  --from-literal=DB_PASSWORD='<new-password>' \
  --dry-run=client -o yaml | \
  kubeseal --format yaml > sealed-secret.yaml

kubectl apply -f sealed-secret.yaml
kubectl rollout restart deployment/membership-api -n membership-prod

Firewall Rules (Hetzner Cloud Firewall)

Rule Direction Protocol Port Source Action
HTTPS Inbound TCP 443 0.0.0.0/0 Allow
SSH Inbound TCP 22 Office IP only Allow
Kubernetes API Inbound TCP 6443 Office IP + CI runner Allow
Node ports Inbound TCP 30000-32767 Internal Allow
All other Inbound Drop
All Outbound Allow

TLS Certificate Management

Certificates are managed by two complementary tools:

  • cert-manager (in-cluster): Provisions Let's Encrypt certificates for Ingress resources automatically
  • Dehydrated (external): Manages origin certificates via DNS-01 challenge (Cloudflare) for non-Ingress services
# Check certificate status
kubectl get certificates -n membership-prod
kubectl describe certificate membership-tls -n membership-prod

# Icinga monitors certificate expiry externally
# Alert threshold: < 14 days remaining

6. Disaster Recovery

Recovery Targets

Metric Target Method
RTO (Recovery Time Objective) 4 hours Velero restore + DB restore
RPO (Recovery Point Objective) 1 hour PostgreSQL WAL archiving

DR Procedure

Scenario: Complete cluster loss

  1. Assess (T+0 to T+15m) - Verify the outage scope (single node vs. full cluster) - Notify stakeholders via incident channel

  2. Provision new cluster (T+15m to T+1h) - Create new Hetzner K8s cluster via Terraform/Console - Install prerequisites: ingress-nginx, cert-manager, sealed-secrets

  3. Restore data (T+1h to T+2h) - Restore PostgreSQL from latest backup (Hetzner PITR or pg_restore) - Restore MinIO documents from Object Storage replica - Restore Kubernetes state from Velero backup

  4. Deploy application (T+2h to T+3h) - Apply sealed secrets - Deploy via Helm: helm upgrade --install membership ./helm/membership -f values-production.yaml - Verify health endpoints

  5. Validate (T+3h to T+4h) - Run smoke tests - Verify data integrity (row counts, recent transactions) - Update DNS if load balancer IP changed - Monitor for 30 minutes

  6. Post-incident (T+4h+) - Write incident report (root cause, timeline, actions) - Update DR procedure if gaps found - Schedule DR drill for next quarter

DR Drills

Conduct quarterly DR drills:

  1. Restore database backup to test environment
  2. Deploy application against restored data
  3. Run full smoke test suite
  4. Measure actual RTO and RPO
  5. Document results and improvement actions

7. Maintenance Windows

Scheduled Maintenance

Activity Schedule Duration Impact
OS security patches Weekly, Sunday 04:00 UTC 15 min/node Rolling, no downtime
Kubernetes upgrade Monthly, 1st Sunday 03:00 UTC 30 min Rolling, brief API blips
PostgreSQL minor upgrade Quarterly 5 min Brief DB reconnect
PostgreSQL major upgrade Annually 1 hour Planned downtime
Sealed Secrets rotation Quarterly 10 min Pod restart
DR drill Quarterly 2 hours No production impact

Maintenance Procedure

  1. Announce maintenance window 48 hours in advance (status page + email)
  2. Create pre-maintenance backup
  3. Perform maintenance
  4. Verify health endpoints and dashboards
  5. Update status page

8. Useful Commands Reference

# --- Kubernetes ---
kubectl get pods -n membership-prod -o wide
kubectl logs -f deployment/membership-api -n membership-prod
kubectl exec -it deployment/membership-api -n membership-prod -- /bin/sh
kubectl top pods -n membership-prod
kubectl rollout restart deployment/membership-api -n membership-prod

# --- Helm ---
helm list -n membership-prod
helm history membership -n membership-prod
helm rollback membership <revision> -n membership-prod

# --- PostgreSQL ---
psql -h <host> -U membership -d membership
SELECT pg_size_pretty(pg_database_size('membership'));
SELECT * FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 10;

# --- Redis ---
redis-cli -h <host> DBSIZE
redis-cli -h <host> INFO memory
redis-cli -h <host> KEYS "brute-force:*"

# --- RabbitMQ ---
rabbitmqctl list_queues name messages consumers
rabbitmqctl list_connections