Hetzner Cloud Infrastructure — Operations Runbook
1. Server Overview
| Property | Value |
|---|---|
| Server | Hetzner AX62 Dedicated |
| CPU | AMD Ryzen 9 7950X3D (16C/32T) |
| RAM | 128 GB DDR5 ECC |
| Storage | 2x 1 TB NVMe SSD (RAID 1) |
| OS | Ubuntu 24.04 LTS |
| Domain | membership-one.com |
| Monthly Cost | ~83 EUR (server + storage box) |
Disk Layout
| Mount | Size | Purpose |
|---|---|---|
/ |
50 GB | OS + system |
/var/lib/docker |
500 GB | Docker volumes |
/home |
200 GB | Developer home dirs |
/backup |
200 GB | Local backup staging |
| Swap | 16 GB | — |
Access
# Admin access
ssh admin@membership-one.com
# Developer access (per user)
ssh dev-ulrich@membership-one.com
2. Service Map
URLs
| Service | URL | Internal Port |
|---|---|---|
| GitLab | https://gitlab.membership-one.com | 8929 |
| Wiki.js | https://wiki.membership-one.com | 3000 |
| Keycloak | https://auth.membership-one.com | 8080 |
| Vaultwarden | https://vault.membership-one.com | 80 |
| Grafana | https://grafana.membership-one.com | 3000 |
| Uptime Kuma | https://status.membership-one.com | 3001 |
| Production | https://app.membership-one.com | 8081 |
| Test | https://test.membership-one.com | 8082 |
| Integration | https://integration.membership-one.com | 8083 |
| Registry | https://registry.membership-one.com | 5050 |
| Traefik Dashboard | https://traefik.membership-one.com | 8080 |
Docker Compose Files
All compose files located at /opt/hetzner/:
| File | Services | Network |
|---|---|---|
docker-compose.proxy.yml |
Traefik | proxy (bridges all) |
docker-compose.mgmt.yml |
GitLab, Keycloak, Wiki.js, Vaultwarden, CI Runners | mgmt |
docker-compose.integration.yml |
App + PG + Redis + RMQ + MinIO | app-integration |
docker-compose.test.yml |
App + PG + Redis + RMQ + MinIO | app-test |
docker-compose.production.yml |
App + PG + Redis + RMQ + MinIO | app-production |
docker-compose.monitoring.yml |
Prometheus, Grafana, Loki, Uptime Kuma | monitoring |
Start/Stop Order
Start (dependencies first):
cd /opt/hetzner
docker compose -f docker-compose.proxy.yml up -d
docker compose -f docker-compose.mgmt.yml up -d
docker compose -f docker-compose.monitoring.yml up -d
docker compose -f docker-compose.production.yml up -d
docker compose -f docker-compose.test.yml up -d
docker compose -f docker-compose.integration.yml up -d
Stop (reverse order):
cd /opt/hetzner
docker compose -f docker-compose.integration.yml down
docker compose -f docker-compose.test.yml down
docker compose -f docker-compose.production.yml down
docker compose -f docker-compose.monitoring.yml down
docker compose -f docker-compose.mgmt.yml down
docker compose -f docker-compose.proxy.yml down
3. Daily Operations
Health Checks
# Quick health check — all environments
curl -sf https://app.membership-one.com/api/actuator/health | jq .
curl -sf https://test.membership-one.com/api/actuator/health | jq .
curl -sf https://integration.membership-one.com/api/actuator/health | jq .
# GitLab health
curl -sf https://gitlab.membership-one.com/-/health | head -1
# Docker container status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | sort
# Disk usage
df -h / /var/lib/docker /home /backup
# Docker volume sizes
docker system df -v
Log Viewing
# Application logs (production)
docker logs -f --tail 100 app-production
# GitLab logs
docker logs -f --tail 50 gitlab
# Keycloak logs
docker logs -f --tail 50 keycloak
# All containers with timestamps
docker logs --since 1h --timestamps app-production
# Via Grafana/Loki
# Navigate to: grafana.membership-one.com → Explore → Loki
# Query: {container_name="app-production"} |= "ERROR"
Database Access
# Production database
docker exec -it prod-db psql -U membership -d membership_prod
# Test database
docker exec -it test-db psql -U membership -d membership_test
# Integration database
docker exec -it int-db psql -U membership -d membership_int
# Useful queries
\dt -- List tables
\dt+ -- List tables with sizes
SELECT count(*) FROM member; -- Count members
SELECT pg_size_pretty(pg_database_size('membership_prod')); -- DB size
4. Deployment Procedures
Automated Deployment (via GitLab CI/CD)
Deployments are triggered automatically by the CI/CD pipeline:
| Branch | Environment | Trigger |
|---|---|---|
develop |
Integration | Auto on push |
release/* |
Test | Manual gate in GitLab |
main |
Production | Manual gate in GitLab |
To trigger a manual deployment:
1. Go to GitLab → CI/CD → Pipelines
2. Find the pipeline for the target branch
3. Click the play button on the deploy-test or deploy-production job
Manual Deployment
# Pull latest image and restart (e.g., production)
cd /opt/hetzner
docker compose -f docker-compose.production.yml pull app-production
docker compose -f docker-compose.production.yml up -d app-production
# Verify health after deploy
sleep 30
curl -sf http://localhost:8081/api/actuator/health | jq .
Rollback
# Find previous image digest
docker images registry.membership-one.com/membership-one/backend --digests
# Roll back to specific tag
docker compose -f docker-compose.production.yml down app-production
# Edit .env.production to set IMAGE_TAG to the previous version
docker compose -f docker-compose.production.yml up -d app-production
Database Migration (Flyway)
Flyway migrations run automatically on application startup. If a migration fails:
# Check migration status
docker exec -it prod-db psql -U membership -d membership_prod \
-c "SELECT * FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 10;"
# If a migration is stuck in FAILED state (repair):
# 1. Fix the migration SQL file
# 2. Connect to the database and clean up:
docker exec -it prod-db psql -U membership -d membership_prod \
-c "DELETE FROM flyway_schema_history WHERE success = false;"
# 3. Restart the application
docker compose -f docker-compose.production.yml restart app-production
5. Backup & Restore
Backup Schedule
| Property | Value |
|---|---|
| Tool | Restic |
| Target | Hetzner Storage Box BX11 (sftp) |
| Schedule | Daily at 02:00 CET |
| Retention | 7 daily, 4 weekly, 3 monthly |
| Cron | 0 2 * * * /opt/hetzner/scripts/backup.sh |
Manual Backup
# Run backup manually
/opt/hetzner/scripts/backup.sh
# Check backup status
restic -r sftp:storage-box:/backups snapshots
# Check backup integrity
restic -r sftp:storage-box:/backups check
# List files in latest snapshot
restic -r sftp:storage-box:/backups ls latest
Restore Procedure
# Interactive restore
/opt/hetzner/scripts/restore.sh
# Or manual steps:
# 1. List snapshots
restic -r sftp:storage-box:/backups snapshots
# 2. Restore specific snapshot
restic -r sftp:storage-box:/backups restore <SNAPSHOT_ID> --target /tmp/restore/
# 3. Stop services
cd /opt/hetzner
docker compose -f docker-compose.production.yml down
# 4. Restore database
docker start prod-db
cat /tmp/restore/backup/dumps/prod-db.sql | docker exec -i prod-db psql -U membership
# 5. Restart services
docker compose -f docker-compose.production.yml up -d
Database-Only Backup
# Dump production database
docker exec prod-db pg_dump -U membership membership_prod > /backup/dumps/prod-$(date +%Y%m%d).sql
# Dump all databases
docker exec prod-db pg_dumpall -U membership > /backup/dumps/all-$(date +%Y%m%d).sql
6. SSL/TLS Certificate Management
Certificates are managed automatically by Traefik + Let's Encrypt via Cloudflare DNS-01 challenge.
Verify Certificates
# Check certificate expiry for a domain
echo | openssl s_client -servername app.membership-one.com -connect app.membership-one.com:443 2>/dev/null | openssl x509 -noout -dates
# Check all certificates via Traefik
curl -s http://localhost:8080/api/http/routers | jq '.[].tls'
Certificate Renewal Issues
If certificates fail to renew:
-
Check Traefik logs:
bash docker logs traefik | grep -i "acme\|cert\|letsencrypt" -
Verify Cloudflare API token:
bash # Token must have Zone:DNS:Edit permission grep CF_DNS_API_TOKEN /opt/hetzner/.env -
Force renewal:
bash # Remove acme.json and restart Traefik docker exec traefik rm /letsencrypt/acme.json docker restart traefik
7. Monitoring & Alerting
Dashboards
| Dashboard | URL | Purpose |
|---|---|---|
| Grafana — JVM | grafana.membership-one.com | Heap, GC, threads per environment |
| Grafana — Docker | grafana.membership-one.com | Container CPU, memory, restarts |
| Grafana — Server | grafana.membership-one.com | Host CPU, RAM, disk, network |
| Grafana — Business | grafana.membership-one.com | Active users, billing stats |
| Uptime Kuma | status.membership-one.com | External availability checks |
Key Metrics to Watch
| Metric | Warning | Critical | Action |
|---|---|---|---|
Disk usage (/var/lib/docker) |
>70% | >85% | Clean up images, prune volumes |
| RAM usage | >80% | >90% | Check for memory leaks, reduce limits |
| Container restarts | >3/hour | >10/hour | Check logs, investigate crash loops |
| JVM heap usage | >75% | >90% | Increase -Xmx or investigate leaks |
| DB connections | >80% pool | >95% pool | Check for connection leaks |
| HTTP 5xx rate | >1% | >5% | Check application logs |
| Certificate expiry | <14 days | <3 days | Check Traefik/Cloudflare config |
Prometheus Queries
# CPU usage per container
rate(container_cpu_usage_seconds_total[5m])
# Memory usage per container
container_memory_usage_bytes
# HTTP request rate (Spring Boot)
rate(http_server_requests_seconds_count[5m])
# HTTP error rate
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
/ rate(http_server_requests_seconds_count[5m])
# JVM heap usage
jvm_memory_used_bytes{area="heap"}
/ jvm_memory_max_bytes{area="heap"}
# Database connection pool
hikaricp_connections_active
8. User Management
Create Developer Account
/opt/hetzner/scripts/create-developer.sh <username> <ssh-public-key-file>
This creates:
- Linux user dev-<username> with SSH key auth
- Docker group membership
- Project directory structure
- tmux configuration
- Resource limits (4 CPU, 8 GB RAM, 50 GB disk)
Remove Developer Account
# Disable account (keep data)
usermod -L dev-<username>
# Or fully remove
userdel -r dev-<username>
Keycloak User Management
- Navigate to https://auth.membership-one.com
- Login with admin credentials (from Vaultwarden)
- Select realm
membership-one - Users → Add user → Set password → Assign to group(s)
Resource Limits per User
# Check current limits
systemctl show user-<UID>.slice | grep -E "CPU|Memory"
# Modify limits (via systemd slice)
systemctl set-property user-<UID>.slice CPUQuota=400% # 4 cores
systemctl set-property user-<UID>.slice MemoryMax=8G
9. Troubleshooting
Container Won't Start
# Check logs
docker logs <container-name> 2>&1 | tail -50
# Check events
docker events --since 1h --filter container=<container-name>
# Check resources
docker stats --no-stream
# Force recreate
docker compose -f <compose-file> up -d --force-recreate <service>
Database Connection Issues
# Test connectivity
docker exec app-production bash -c "pg_isready -h prod-db -U membership"
# Check connection pool
curl -s http://localhost:8081/api/actuator/metrics/hikaricp.connections.active | jq .
# Check PostgreSQL connections
docker exec prod-db psql -U membership -d membership_prod \
-c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
Out of Disk Space
# Quick cleanup
docker system prune -af --volumes # WARNING: removes unused volumes!
# Safer cleanup (images only)
docker image prune -af
# Check large files
ncdu /var/lib/docker
# Check Docker volumes
docker volume ls -q | while read v; do
echo "$(docker volume inspect --format '{{.Mountpoint}}' $v): $(du -sh $(docker volume inspect --format '{{.Mountpoint}}' $v) 2>/dev/null | cut -f1)"
done
GitLab Issues
# GitLab status
docker exec gitlab gitlab-ctl status
# Reconfigure after config changes
docker exec gitlab gitlab-ctl reconfigure
# GitLab Rails console (advanced)
docker exec -it gitlab gitlab-rails console
# Reset GitLab admin password
docker exec -it gitlab gitlab-rake "gitlab:password:reset[root]"
Keycloak Issues
# Keycloak logs
docker logs keycloak | grep -i error
# Export realm (for backup)
docker exec keycloak /opt/keycloak/bin/kc.sh export \
--dir /opt/keycloak/data/export --realm membership-one
# Import realm
docker exec keycloak /opt/keycloak/bin/kc.sh import \
--dir /opt/keycloak/data/import
High Memory Usage
# Top memory consumers
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -rh
# Check OOM kills
dmesg | grep -i "out of memory\|oom"
# Check swap usage
free -h
swapon --show
# If GitLab is consuming too much RAM:
# Edit GITLAB_OMNIBUS_CONFIG in docker-compose.mgmt.yml:
# - Reduce Puma workers: puma['worker_processes'] = 2
# - Reduce Sidekiq concurrency: sidekiq['concurrency'] = 5
# - Disable unused features
10. Security
Firewall Rules
# View current rules
ufw status verbose
# Expected rules:
# 22/tcp ALLOW (SSH)
# 80/tcp ALLOW (HTTP → redirect to HTTPS)
# 443/tcp ALLOW (HTTPS)
# 51820/udp ALLOW (WireGuard VPN)
SSH Hardening
- Password authentication: disabled
- Root login: disabled
- Only SSH key authentication permitted
- Config:
/etc/ssh/sshd_config
Security Updates
# Check for updates
apt update && apt list --upgradable
# Apply security updates only
unattended-upgrade --dry-run # preview
unattended-upgrade # apply
# Full system update (with reboot planning)
apt update && apt upgrade -y
# Reboot if kernel was updated:
# Check: ls /var/run/reboot-required
WireGuard VPN
# VPN status
wg show
# Add peer
wg set wg0 peer <PUBLIC_KEY> allowed-ips <IP>/32
# Generate client config
wg genkey | tee /tmp/client.key | wg pubkey > /tmp/client.pub
11. Disaster Recovery
RTO / RPO Targets
| Metric | Target |
|---|---|
| RPO (Recovery Point Objective) | 24 hours (daily backups) |
| RTO (Recovery Time Objective) | 4 hours |
Full Server Recovery Procedure
- Order new Hetzner AX62 (or request server reset via Hetzner Robot)
- Run bootstrap script:
/opt/hetzner/scripts/bootstrap.sh - Restore from backup:
/opt/hetzner/scripts/restore.sh - Verify DNS: Ensure Cloudflare DNS points to new server IP
- Verify all services: Check each URL in the service map
- Verify backups: Ensure new server's backup cron is active
Partial Recovery
# Restore only a specific database
restic -r sftp:storage-box:/backups restore latest \
--target /tmp/restore --include /backup/dumps/prod-db.sql
docker exec -i prod-db psql -U membership < /tmp/restore/backup/dumps/prod-db.sql
12. Maintenance Windows
Recommended Schedule
| Task | Frequency | Window |
|---|---|---|
| OS security updates | Weekly | Sunday 03:00–04:00 CET |
| Docker image updates | Monthly | First Sunday 03:00–05:00 CET |
| PostgreSQL minor updates | Quarterly | Planned maintenance window |
| Full backup restore test | Monthly | Saturday 10:00–12:00 CET |
| SSL certificate check | Weekly | Automated (Uptime Kuma) |
| Disk usage review | Weekly | Automated (Grafana alert) |
| Log rotation cleanup | Daily | Automated (Docker log rotation) |
Pre-Maintenance Checklist
- Notify team via communication channel
- Ensure latest backup is available
- Note current container image tags/digests
- Check Uptime Kuma for baseline availability
- Schedule Uptime Kuma maintenance window (suppress alerts)
Post-Maintenance Checklist
- All containers running (
docker ps) - All health endpoints responding
- Grafana dashboards showing metrics
- Uptime Kuma checks passing
- Backup cron still scheduled
- End Uptime Kuma maintenance window
13. Contact & Escalation
| Level | Contact | Scope |
|---|---|---|
| L1 | On-call developer | Application issues, restarts |
| L2 | Infrastructure admin | Server, Docker, networking |
| L3 | Hetzner Support | Hardware failures, network outages |
Hetzner Support
- Robot Console: https://robot.hetzner.com
- Support: https://docs.hetzner.com/general/others/support/
- Status Page: https://status.hetzner.com
Useful Hetzner Robot Tasks
- KVM console (remote access when SSH is down)
- Hardware RAID monitoring
- Server reboot / rescue mode
- Reverse DNS configuration