IT Operations Guide — Membership One
1. Overview
This guide covers infrastructure management, backup procedures, scaling, security hardening, and disaster recovery for the Membership One platform hosted on Hetzner Cloud.
2. Infrastructure Architecture
Hetzner Cloud Components
Internet
|
[Cloudflare DNS/WAF]
|
[Hetzner Load Balancer]
LB11 (EUR 6/mo)
|
+------+-------+-------+------+
| | | | |
[App-1] [App-2] [App-3] [Infra-1] [Infra-2]
CX32 CX32 CX32 CX22 CX22
| | | | |
+-------+-------+ +--------+
| |
Membership Pods Monitoring Stack
(API + Web) (Prometheus, Grafana,
Loki, Alertmanager,
Icinga)
|
+----+----+----+----+
| | | |
[PG] [Redis] [RMQ] [MinIO]
CPX21 CX11 CX11 Object
Storage
Monthly Cost Estimate
| Resource | Type | Cost/mo |
|---|---|---|
| 3x App nodes | CX32 (4 vCPU, 8 GB) | 3 x EUR 16 = EUR 48 |
| 2x Infra nodes | CX22 (2 vCPU, 4 GB) | 2 x EUR 6 = EUR 12 |
| Managed PostgreSQL | CPX21 | EUR 15 |
| Load Balancer | LB11 | EUR 6 |
| Object Storage | 100 GB | EUR 3 |
| Floating IPs | 2x | EUR 5 |
| Snapshots/Backups | — | EUR 5 |
| Total | ~EUR 94/mo |
Network Configuration
| Service | Port | Protocol | Access |
|---|---|---|---|
| API (Ingress) | 443 | HTTPS | Public |
| PostgreSQL | 5432 | TCP | Internal (K8s network) |
| Redis | 6379 | TCP | Internal |
| RabbitMQ | 5672 | AMQP | Internal |
| RabbitMQ Management | 15672 | HTTP | Internal |
| MinIO | 9000 | HTTP | Internal |
| Prometheus | 9090 | HTTP | Internal |
| Grafana | 3000 | HTTP | Internal (+ Ingress) |
3. Backup Strategy
Overview
| Data Source | Method | Schedule | Retention | Storage |
|---|---|---|---|---|
| PostgreSQL | pg_dump + Hetzner PITR |
Daily 02:00 UTC | 30 days | Hetzner Object Storage |
| PostgreSQL WAL | Continuous archiving | Continuous | 7 days | Hetzner Object Storage |
| Kubernetes state | Velero | Daily 03:00 UTC | 14 days | Hetzner Object Storage |
| MinIO documents | Cross-location replication | Continuous | Same as source | Hetzner Object Storage (secondary location) |
| Sealed Secrets keys | Manual export | After rotation | Indefinite | Vaultwarden |
| Helm values | Git (GitLab) | On change | Indefinite | GitLab |
PostgreSQL Backup
# Manual backup
kubectl exec -n membership-prod deployment/membership-db-backup -- \
pg_dump -h <DB_HOST> -U membership -Fc membership > backup-$(date +%Y%m%d).dump
# Restore from backup
pg_restore -h <DB_HOST> -U membership -d membership -c backup-20260223.dump
# Point-in-time recovery (Hetzner managed)
# Use Hetzner Cloud Console > Databases > Restore > Select timestamp
Velero Backup
# Install Velero
velero install --provider aws \
--bucket membership-backups \
--backup-location-config s3Url=https://s3.hetzner.cloud,region=eu-central \
--use-volume-snapshots=false
# Manual backup
velero backup create membership-manual-$(date +%Y%m%d) \
--include-namespaces membership-prod
# Schedule daily backup
velero schedule create membership-daily \
--schedule="0 3 * * *" \
--include-namespaces membership-prod \
--ttl 336h
# Restore
velero restore create --from-backup membership-manual-20260223
Backup Verification
Run monthly restore tests:
- Restore PostgreSQL backup to a test database
- Verify row counts match production
- Run application health check against restored data
- Document results in backup verification log
4. Scaling Procedures
Horizontal Scaling (HPA)
The Horizontal Pod Autoscaler is configured in production:
# values-production.yaml
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 6
targetCPUUtilizationPercentage: 70
# Check HPA status
kubectl get hpa -n membership-prod
# Manual scale (override HPA temporarily)
kubectl scale deployment membership-api -n membership-prod --replicas=5
# Reset to HPA control
kubectl annotate deployment membership-api -n membership-prod \
kubectl.kubernetes.io/last-applied-configuration-
Vertical Scaling (Node Upgrade)
To upgrade Hetzner nodes:
- Cordon the node:
kubectl cordon <node-name> - Drain the node:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data - Upgrade in Hetzner Cloud Console (e.g., CX32 -> CX42)
- Uncordon:
kubectl uncordon <node-name> - Verify pods reschedule:
kubectl get pods -n membership-prod -o wide
Database Scaling
- Read replicas: Add Hetzner managed read replicas for read-heavy queries
- Connection pooling: PgBouncer in front of PostgreSQL if connection count exceeds 200
- Vertical: Upgrade CPX21 -> CPX31 via Hetzner Cloud Console (brief downtime)
5. Security Hardening
Kubernetes Security
# Pod Security Standards (restricted)
apiVersion: v1
kind: Namespace
metadata:
name: membership-prod
labels:
pod-security.kubernetes.io/enforce: restricted
- Network Policies: Restrict pod-to-pod communication; only API pods can reach DB/Redis/RabbitMQ
- RBAC: Minimal service account permissions; no cluster-admin for application pods
- Image Scanning: GitLab Container Scanning in CI pipeline
- No root containers:
runAsNonRoot: true,readOnlyRootFilesystem: true
Secret Management
| Secret Type | Tool | Rotation |
|---|---|---|
| Application secrets | Sealed Secrets (Bitnami) | Quarterly |
| Team passwords | Vaultwarden | On change |
| TLS certificates | Dehydrated + cert-manager (auto) | Auto (Let's Encrypt, 90-day) |
| JWT signing keys | Sealed Secrets | Annually |
| Database passwords | Sealed Secrets | Quarterly |
# Rotate a sealed secret
kubectl create secret generic membership-secrets \
--from-literal=DB_PASSWORD='<new-password>' \
--dry-run=client -o yaml | \
kubeseal --format yaml > sealed-secret.yaml
kubectl apply -f sealed-secret.yaml
kubectl rollout restart deployment/membership-api -n membership-prod
Firewall Rules (Hetzner Cloud Firewall)
| Rule | Direction | Protocol | Port | Source | Action |
|---|---|---|---|---|---|
| HTTPS | Inbound | TCP | 443 | 0.0.0.0/0 | Allow |
| SSH | Inbound | TCP | 22 | Office IP only | Allow |
| Kubernetes API | Inbound | TCP | 6443 | Office IP + CI runner | Allow |
| Node ports | Inbound | TCP | 30000-32767 | Internal | Allow |
| All other | Inbound | — | — | — | Drop |
| All | Outbound | — | — | — | Allow |
TLS Certificate Management
Certificates are managed by two complementary tools:
- cert-manager (in-cluster): Provisions Let's Encrypt certificates for Ingress resources automatically
- Dehydrated (external): Manages origin certificates via DNS-01 challenge (Cloudflare) for non-Ingress services
# Check certificate status
kubectl get certificates -n membership-prod
kubectl describe certificate membership-tls -n membership-prod
# Icinga monitors certificate expiry externally
# Alert threshold: < 14 days remaining
6. Disaster Recovery
Recovery Targets
| Metric | Target | Method |
|---|---|---|
| RTO (Recovery Time Objective) | 4 hours | Velero restore + DB restore |
| RPO (Recovery Point Objective) | 1 hour | PostgreSQL WAL archiving |
DR Procedure
Scenario: Complete cluster loss
-
Assess (T+0 to T+15m) - Verify the outage scope (single node vs. full cluster) - Notify stakeholders via incident channel
-
Provision new cluster (T+15m to T+1h) - Create new Hetzner K8s cluster via Terraform/Console - Install prerequisites: ingress-nginx, cert-manager, sealed-secrets
-
Restore data (T+1h to T+2h) - Restore PostgreSQL from latest backup (Hetzner PITR or
pg_restore) - Restore MinIO documents from Object Storage replica - Restore Kubernetes state from Velero backup -
Deploy application (T+2h to T+3h) - Apply sealed secrets - Deploy via Helm:
helm upgrade --install membership ./helm/membership -f values-production.yaml- Verify health endpoints -
Validate (T+3h to T+4h) - Run smoke tests - Verify data integrity (row counts, recent transactions) - Update DNS if load balancer IP changed - Monitor for 30 minutes
-
Post-incident (T+4h+) - Write incident report (root cause, timeline, actions) - Update DR procedure if gaps found - Schedule DR drill for next quarter
DR Drills
Conduct quarterly DR drills:
- Restore database backup to test environment
- Deploy application against restored data
- Run full smoke test suite
- Measure actual RTO and RPO
- Document results and improvement actions
7. Maintenance Windows
Scheduled Maintenance
| Activity | Schedule | Duration | Impact |
|---|---|---|---|
| OS security patches | Weekly, Sunday 04:00 UTC | 15 min/node | Rolling, no downtime |
| Kubernetes upgrade | Monthly, 1st Sunday 03:00 UTC | 30 min | Rolling, brief API blips |
| PostgreSQL minor upgrade | Quarterly | 5 min | Brief DB reconnect |
| PostgreSQL major upgrade | Annually | 1 hour | Planned downtime |
| Sealed Secrets rotation | Quarterly | 10 min | Pod restart |
| DR drill | Quarterly | 2 hours | No production impact |
Maintenance Procedure
- Announce maintenance window 48 hours in advance (status page + email)
- Create pre-maintenance backup
- Perform maintenance
- Verify health endpoints and dashboards
- Update status page
8. Useful Commands Reference
# --- Kubernetes ---
kubectl get pods -n membership-prod -o wide
kubectl logs -f deployment/membership-api -n membership-prod
kubectl exec -it deployment/membership-api -n membership-prod -- /bin/sh
kubectl top pods -n membership-prod
kubectl rollout restart deployment/membership-api -n membership-prod
# --- Helm ---
helm list -n membership-prod
helm history membership -n membership-prod
helm rollback membership <revision> -n membership-prod
# --- PostgreSQL ---
psql -h <host> -U membership -d membership
SELECT pg_size_pretty(pg_database_size('membership'));
SELECT * FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 10;
# --- Redis ---
redis-cli -h <host> DBSIZE
redis-cli -h <host> INFO memory
redis-cli -h <host> KEYS "brute-force:*"
# --- RabbitMQ ---
rabbitmqctl list_queues name messages consumers
rabbitmqctl list_connections