Hetzner Cloud Infrastructure — Operations Runbook

1. Server Overview

Property Value
Server Hetzner AX62 Dedicated
CPU AMD Ryzen 9 7950X3D (16C/32T)
RAM 128 GB DDR5 ECC
Storage 2x 1 TB NVMe SSD (RAID 1)
OS Ubuntu 24.04 LTS
Domain membership-one.com
Monthly Cost ~83 EUR (server + storage box)

Disk Layout

Mount Size Purpose
/ 50 GB OS + system
/var/lib/docker 500 GB Docker volumes
/home 200 GB Developer home dirs
/backup 200 GB Local backup staging
Swap 16 GB

Access

# Admin access
ssh admin@membership-one.com

# Developer access (per user)
ssh dev-ulrich@membership-one.com

2. Service Map

URLs

Service URL Internal Port
GitLab https://gitlab.membership-one.com 8929
Wiki.js https://wiki.membership-one.com 3000
Keycloak https://auth.membership-one.com 8080
Vaultwarden https://vault.membership-one.com 80
Grafana https://grafana.membership-one.com 3000
Uptime Kuma https://status.membership-one.com 3001
Production https://app.membership-one.com 8081
Test https://test.membership-one.com 8082
Integration https://integration.membership-one.com 8083
Registry https://registry.membership-one.com 5050
Traefik Dashboard https://traefik.membership-one.com 8080

Docker Compose Files

All compose files located at /opt/hetzner/:

File Services Network
docker-compose.proxy.yml Traefik proxy (bridges all)
docker-compose.mgmt.yml GitLab, Keycloak, Wiki.js, Vaultwarden, CI Runners mgmt
docker-compose.integration.yml App + PG + Redis + RMQ + MinIO app-integration
docker-compose.test.yml App + PG + Redis + RMQ + MinIO app-test
docker-compose.production.yml App + PG + Redis + RMQ + MinIO app-production
docker-compose.monitoring.yml Prometheus, Grafana, Loki, Uptime Kuma monitoring

Start/Stop Order

Start (dependencies first):

cd /opt/hetzner
docker compose -f docker-compose.proxy.yml up -d
docker compose -f docker-compose.mgmt.yml up -d
docker compose -f docker-compose.monitoring.yml up -d
docker compose -f docker-compose.production.yml up -d
docker compose -f docker-compose.test.yml up -d
docker compose -f docker-compose.integration.yml up -d

Stop (reverse order):

cd /opt/hetzner
docker compose -f docker-compose.integration.yml down
docker compose -f docker-compose.test.yml down
docker compose -f docker-compose.production.yml down
docker compose -f docker-compose.monitoring.yml down
docker compose -f docker-compose.mgmt.yml down
docker compose -f docker-compose.proxy.yml down

3. Daily Operations

Health Checks

# Quick health check — all environments
curl -sf https://app.membership-one.com/api/actuator/health | jq .
curl -sf https://test.membership-one.com/api/actuator/health | jq .
curl -sf https://integration.membership-one.com/api/actuator/health | jq .

# GitLab health
curl -sf https://gitlab.membership-one.com/-/health | head -1

# Docker container status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | sort

# Disk usage
df -h / /var/lib/docker /home /backup

# Docker volume sizes
docker system df -v

Log Viewing

# Application logs (production)
docker logs -f --tail 100 app-production

# GitLab logs
docker logs -f --tail 50 gitlab

# Keycloak logs
docker logs -f --tail 50 keycloak

# All containers with timestamps
docker logs --since 1h --timestamps app-production

# Via Grafana/Loki
# Navigate to: grafana.membership-one.com → Explore → Loki
# Query: {container_name="app-production"} |= "ERROR"

Database Access

# Production database
docker exec -it prod-db psql -U membership -d membership_prod

# Test database
docker exec -it test-db psql -U membership -d membership_test

# Integration database
docker exec -it int-db psql -U membership -d membership_int

# Useful queries
\dt                              -- List tables
\dt+                             -- List tables with sizes
SELECT count(*) FROM member;     -- Count members
SELECT pg_size_pretty(pg_database_size('membership_prod'));  -- DB size

4. Deployment Procedures

Automated Deployment (via GitLab CI/CD)

Deployments are triggered automatically by the CI/CD pipeline:

Branch Environment Trigger
develop Integration Auto on push
release/* Test Manual gate in GitLab
main Production Manual gate in GitLab

To trigger a manual deployment: 1. Go to GitLab → CI/CD → Pipelines 2. Find the pipeline for the target branch 3. Click the play button on the deploy-test or deploy-production job

Manual Deployment

# Pull latest image and restart (e.g., production)
cd /opt/hetzner
docker compose -f docker-compose.production.yml pull app-production
docker compose -f docker-compose.production.yml up -d app-production

# Verify health after deploy
sleep 30
curl -sf http://localhost:8081/api/actuator/health | jq .

Rollback

# Find previous image digest
docker images registry.membership-one.com/membership-one/backend --digests

# Roll back to specific tag
docker compose -f docker-compose.production.yml down app-production
# Edit .env.production to set IMAGE_TAG to the previous version
docker compose -f docker-compose.production.yml up -d app-production

Database Migration (Flyway)

Flyway migrations run automatically on application startup. If a migration fails:

# Check migration status
docker exec -it prod-db psql -U membership -d membership_prod \
  -c "SELECT * FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 10;"

# If a migration is stuck in FAILED state (repair):
# 1. Fix the migration SQL file
# 2. Connect to the database and clean up:
docker exec -it prod-db psql -U membership -d membership_prod \
  -c "DELETE FROM flyway_schema_history WHERE success = false;"
# 3. Restart the application
docker compose -f docker-compose.production.yml restart app-production

5. Backup & Restore

Backup Schedule

Property Value
Tool Restic
Target Hetzner Storage Box BX11 (sftp)
Schedule Daily at 02:00 CET
Retention 7 daily, 4 weekly, 3 monthly
Cron 0 2 * * * /opt/hetzner/scripts/backup.sh

Manual Backup

# Run backup manually
/opt/hetzner/scripts/backup.sh

# Check backup status
restic -r sftp:storage-box:/backups snapshots

# Check backup integrity
restic -r sftp:storage-box:/backups check

# List files in latest snapshot
restic -r sftp:storage-box:/backups ls latest

Restore Procedure

# Interactive restore
/opt/hetzner/scripts/restore.sh

# Or manual steps:

# 1. List snapshots
restic -r sftp:storage-box:/backups snapshots

# 2. Restore specific snapshot
restic -r sftp:storage-box:/backups restore <SNAPSHOT_ID> --target /tmp/restore/

# 3. Stop services
cd /opt/hetzner
docker compose -f docker-compose.production.yml down

# 4. Restore database
docker start prod-db
cat /tmp/restore/backup/dumps/prod-db.sql | docker exec -i prod-db psql -U membership

# 5. Restart services
docker compose -f docker-compose.production.yml up -d

Database-Only Backup

# Dump production database
docker exec prod-db pg_dump -U membership membership_prod > /backup/dumps/prod-$(date +%Y%m%d).sql

# Dump all databases
docker exec prod-db pg_dumpall -U membership > /backup/dumps/all-$(date +%Y%m%d).sql

6. SSL/TLS Certificate Management

Certificates are managed automatically by Traefik + Let's Encrypt via Cloudflare DNS-01 challenge.

Verify Certificates

# Check certificate expiry for a domain
echo | openssl s_client -servername app.membership-one.com -connect app.membership-one.com:443 2>/dev/null | openssl x509 -noout -dates

# Check all certificates via Traefik
curl -s http://localhost:8080/api/http/routers | jq '.[].tls'

Certificate Renewal Issues

If certificates fail to renew:

  1. Check Traefik logs: bash docker logs traefik | grep -i "acme\|cert\|letsencrypt"

  2. Verify Cloudflare API token: bash # Token must have Zone:DNS:Edit permission grep CF_DNS_API_TOKEN /opt/hetzner/.env

  3. Force renewal: bash # Remove acme.json and restart Traefik docker exec traefik rm /letsencrypt/acme.json docker restart traefik


7. Monitoring & Alerting

Dashboards

Dashboard URL Purpose
Grafana — JVM grafana.membership-one.com Heap, GC, threads per environment
Grafana — Docker grafana.membership-one.com Container CPU, memory, restarts
Grafana — Server grafana.membership-one.com Host CPU, RAM, disk, network
Grafana — Business grafana.membership-one.com Active users, billing stats
Uptime Kuma status.membership-one.com External availability checks

Key Metrics to Watch

Metric Warning Critical Action
Disk usage (/var/lib/docker) >70% >85% Clean up images, prune volumes
RAM usage >80% >90% Check for memory leaks, reduce limits
Container restarts >3/hour >10/hour Check logs, investigate crash loops
JVM heap usage >75% >90% Increase -Xmx or investigate leaks
DB connections >80% pool >95% pool Check for connection leaks
HTTP 5xx rate >1% >5% Check application logs
Certificate expiry <14 days <3 days Check Traefik/Cloudflare config

Prometheus Queries

# CPU usage per container
rate(container_cpu_usage_seconds_total[5m])

# Memory usage per container
container_memory_usage_bytes

# HTTP request rate (Spring Boot)
rate(http_server_requests_seconds_count[5m])

# HTTP error rate
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
/ rate(http_server_requests_seconds_count[5m])

# JVM heap usage
jvm_memory_used_bytes{area="heap"}
/ jvm_memory_max_bytes{area="heap"}

# Database connection pool
hikaricp_connections_active

8. User Management

Create Developer Account

/opt/hetzner/scripts/create-developer.sh <username> <ssh-public-key-file>

This creates: - Linux user dev-<username> with SSH key auth - Docker group membership - Project directory structure - tmux configuration - Resource limits (4 CPU, 8 GB RAM, 50 GB disk)

Remove Developer Account

# Disable account (keep data)
usermod -L dev-<username>

# Or fully remove
userdel -r dev-<username>

Keycloak User Management

  1. Navigate to https://auth.membership-one.com
  2. Login with admin credentials (from Vaultwarden)
  3. Select realm membership-one
  4. Users → Add user → Set password → Assign to group(s)

Resource Limits per User

# Check current limits
systemctl show user-<UID>.slice | grep -E "CPU|Memory"

# Modify limits (via systemd slice)
systemctl set-property user-<UID>.slice CPUQuota=400%     # 4 cores
systemctl set-property user-<UID>.slice MemoryMax=8G

9. Troubleshooting

Container Won't Start

# Check logs
docker logs <container-name> 2>&1 | tail -50

# Check events
docker events --since 1h --filter container=<container-name>

# Check resources
docker stats --no-stream

# Force recreate
docker compose -f <compose-file> up -d --force-recreate <service>

Database Connection Issues

# Test connectivity
docker exec app-production bash -c "pg_isready -h prod-db -U membership"

# Check connection pool
curl -s http://localhost:8081/api/actuator/metrics/hikaricp.connections.active | jq .

# Check PostgreSQL connections
docker exec prod-db psql -U membership -d membership_prod \
  -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

Out of Disk Space

# Quick cleanup
docker system prune -af --volumes   # WARNING: removes unused volumes!

# Safer cleanup (images only)
docker image prune -af

# Check large files
ncdu /var/lib/docker

# Check Docker volumes
docker volume ls -q | while read v; do
  echo "$(docker volume inspect --format '{{.Mountpoint}}' $v): $(du -sh $(docker volume inspect --format '{{.Mountpoint}}' $v) 2>/dev/null | cut -f1)"
done

GitLab Issues

# GitLab status
docker exec gitlab gitlab-ctl status

# Reconfigure after config changes
docker exec gitlab gitlab-ctl reconfigure

# GitLab Rails console (advanced)
docker exec -it gitlab gitlab-rails console

# Reset GitLab admin password
docker exec -it gitlab gitlab-rake "gitlab:password:reset[root]"

Keycloak Issues

# Keycloak logs
docker logs keycloak | grep -i error

# Export realm (for backup)
docker exec keycloak /opt/keycloak/bin/kc.sh export \
  --dir /opt/keycloak/data/export --realm membership-one

# Import realm
docker exec keycloak /opt/keycloak/bin/kc.sh import \
  --dir /opt/keycloak/data/import

High Memory Usage

# Top memory consumers
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -rh

# Check OOM kills
dmesg | grep -i "out of memory\|oom"

# Check swap usage
free -h
swapon --show

# If GitLab is consuming too much RAM:
# Edit GITLAB_OMNIBUS_CONFIG in docker-compose.mgmt.yml:
# - Reduce Puma workers: puma['worker_processes'] = 2
# - Reduce Sidekiq concurrency: sidekiq['concurrency'] = 5
# - Disable unused features

10. Security

Firewall Rules

# View current rules
ufw status verbose

# Expected rules:
# 22/tcp     ALLOW     (SSH)
# 80/tcp     ALLOW     (HTTP → redirect to HTTPS)
# 443/tcp    ALLOW     (HTTPS)
# 51820/udp  ALLOW     (WireGuard VPN)

SSH Hardening

  • Password authentication: disabled
  • Root login: disabled
  • Only SSH key authentication permitted
  • Config: /etc/ssh/sshd_config

Security Updates

# Check for updates
apt update && apt list --upgradable

# Apply security updates only
unattended-upgrade --dry-run  # preview
unattended-upgrade             # apply

# Full system update (with reboot planning)
apt update && apt upgrade -y
# Reboot if kernel was updated:
# Check: ls /var/run/reboot-required

WireGuard VPN

# VPN status
wg show

# Add peer
wg set wg0 peer <PUBLIC_KEY> allowed-ips <IP>/32

# Generate client config
wg genkey | tee /tmp/client.key | wg pubkey > /tmp/client.pub

11. Disaster Recovery

RTO / RPO Targets

Metric Target
RPO (Recovery Point Objective) 24 hours (daily backups)
RTO (Recovery Time Objective) 4 hours

Full Server Recovery Procedure

  1. Order new Hetzner AX62 (or request server reset via Hetzner Robot)
  2. Run bootstrap script: /opt/hetzner/scripts/bootstrap.sh
  3. Restore from backup: /opt/hetzner/scripts/restore.sh
  4. Verify DNS: Ensure Cloudflare DNS points to new server IP
  5. Verify all services: Check each URL in the service map
  6. Verify backups: Ensure new server's backup cron is active

Partial Recovery

# Restore only a specific database
restic -r sftp:storage-box:/backups restore latest \
  --target /tmp/restore --include /backup/dumps/prod-db.sql

docker exec -i prod-db psql -U membership < /tmp/restore/backup/dumps/prod-db.sql

12. Maintenance Windows

Task Frequency Window
OS security updates Weekly Sunday 03:00–04:00 CET
Docker image updates Monthly First Sunday 03:00–05:00 CET
PostgreSQL minor updates Quarterly Planned maintenance window
Full backup restore test Monthly Saturday 10:00–12:00 CET
SSL certificate check Weekly Automated (Uptime Kuma)
Disk usage review Weekly Automated (Grafana alert)
Log rotation cleanup Daily Automated (Docker log rotation)

Pre-Maintenance Checklist

  • Notify team via communication channel
  • Ensure latest backup is available
  • Note current container image tags/digests
  • Check Uptime Kuma for baseline availability
  • Schedule Uptime Kuma maintenance window (suppress alerts)

Post-Maintenance Checklist

  • All containers running (docker ps)
  • All health endpoints responding
  • Grafana dashboards showing metrics
  • Uptime Kuma checks passing
  • Backup cron still scheduled
  • End Uptime Kuma maintenance window

13. Contact & Escalation

Level Contact Scope
L1 On-call developer Application issues, restarts
L2 Infrastructure admin Server, Docker, networking
L3 Hetzner Support Hardware failures, network outages

Hetzner Support

  • Robot Console: https://robot.hetzner.com
  • Support: https://docs.hetzner.com/general/others/support/
  • Status Page: https://status.hetzner.com

Useful Hetzner Robot Tasks

  • KVM console (remote access when SSH is down)
  • Hardware RAID monitoring
  • Server reboot / rescue mode
  • Reverse DNS configuration