Infrastructure and Deployment
Overview
The Membership platform is designed for cloud-native deployment on Hetzner Cloud (Kubernetes) while maintaining the option for on-premises installation. The infrastructure supports a small team (4 people) by maximizing automation: infrastructure as code, CI/CD pipelines, automated testing, and zero-downtime deployments. The initial target market is Germany and Austria, with the architecture ready for multi-region European expansion.
Deployment Architecture
Cloud-Native (Primary)
Container Architecture
Each component runs in its own Docker container:
| Container | Base Image | Purpose | Replicas |
|---|---|---|---|
membership-api |
eclipse-temurin:25-jre-alpine |
REST API server | 2+ (HPA) |
membership-worker |
eclipse-temurin:25-jre-alpine |
Background jobs (billing, import, sync) | 1-2 |
membership-scheduler |
eclipse-temurin:25-jre-alpine |
Cron jobs (billing cycle, status sync) | 1 (leader election) |
membership-migrate |
flyway/flyway:latest |
Database migrations (init container) | 1 (run-to-completion) |
On-Premises Option
For organizations that require data sovereignty or have existing infrastructure:
| Component | On-Premises Alternative |
|---|---|
| Kubernetes | Docker Compose or single-node K3s |
| Managed PostgreSQL | Self-hosted PostgreSQL 18 |
| Object Storage | MinIO (S3-compatible) |
| Secrets Manager | HashiCorp Vault |
| CDN | nginx with caching |
| Monitoring | Self-hosted Prometheus + Grafana stack |
A single Docker Compose file is provided for small deployments (< 500 members, single organization). For larger deployments, the Kubernetes manifests are the recommended path.
Hetzner Cloud Specifications
The primary production deployment uses Hetzner Cloud, selected for its German data center locations, competitive pricing, and GDPR compliance (data stays in Germany).
| Component | Hetzner Product | Specification | Monthly Cost |
|---|---|---|---|
| App Nodes (3x) | CX32 | 4 vCPU, 8 GB RAM, 80 GB NVMe | 3 x EUR 15.59 = EUR 46.77 |
| Infra Nodes (2x) | CX22 | 2 vCPU, 4 GB RAM, 40 GB NVMe | 2 x EUR 5.39 = EUR 10.78 |
| Database | Managed PostgreSQL (CPX21) | 3 vCPU, 4 GB RAM, 80 GB NVMe, PITR | EUR 17.85 |
| Load Balancer | LB11 | 25 concurrent connections, TLS termination | EUR 6.41 |
| Object Storage | Hetzner S3 | ~100 GB (documents, backups, uploads) | ~EUR 3.00 |
| Floating IPs | 2x IPv4 | Ingress + Admin access | 2 x EUR 5.05 = EUR 10.10 |
| Cloud Network | Private VLAN | 10.0.0.0/16 (3 subnets) | EUR 0.00 |
| Total Hetzner | ~EUR 94.91 | ||
| Cloudflare | Free Plan | DNS, CDN, WAF, DDoS protection | EUR 0.00 |
| Domains | .com + .de | 2 domains | ~EUR 3.00 |
| Total Infrastructure | ~EUR 100/month |
Locations: - Primary: Nuremberg (nbg1) -- all production workloads - DR: Falkenstein (fsn1) -- backup replication, failover target - RTO: 4 hours | RPO: 1 hour
CI/CD Pipeline
Pipeline Architecture
Pipeline Stages
| Stage | Tool | Duration Target | Gate |
|---|---|---|---|
| Compile | Maven | < 2 min | Compilation success |
| Unit Tests | JUnit 5 + JaCoCo | < 5 min | All pass, coverage > 80% |
| Code Quality | SonarQube | < 3 min | No new Critical/Blocker issues |
| Security Scan | Trivy (container) + Snyk (dependencies) | < 3 min | No Critical CVEs |
| Integration Tests | Testcontainers (PostgreSQL, Redis, RabbitMQ) | < 10 min | All pass |
| Build Image | Docker BuildKit | < 3 min | Image builds successfully |
| Deploy Dev | Kubernetes (kubectl/Helm) | < 2 min | Pods healthy |
| Smoke Tests | REST Assured | < 2 min | Health + key endpoints respond |
| Deploy Staging | Kubernetes (Helm) | < 2 min | Pods healthy |
| E2E Tests | Playwright (Flutter Web) | < 15 min | All pass |
| Deploy Production | Kubernetes (Helm, canary) | < 5 min | Canary metrics healthy |
Total pipeline target: < 50 minutes from push to production.
Deployment Strategy
- Development: Direct deployment on every merge to
developbranch - Staging: Automatic deployment on every merge to
mainbranch - Production: Canary deployment (10% traffic for 15 minutes, automatic rollback on error rate > 1%)
Rollback
- Automated: If canary deployment shows elevated error rates or latency, Kubernetes automatically rolls back to the previous version
- Manual:
helm rollback membership <revision>for instant rollback to any previous release - Database: Flyway migrations are forward-only. Rollback scripts are pre-written for each migration and tested in CI.
Docker Compose (Local Development)
# docker-compose.yml (for local development)
services:
postgres:
image: postgres:18-alpine
environment:
POSTGRES_DB: membership
POSTGRES_USER: membership
POSTGRES_PASSWORD: ${DB_PASSWORD:-devpass}
ports:
- "5432:5432"
volumes:
- postgres-data:/var/lib/postgresql/data
- ./db/init:/docker-entrypoint-initdb.d
healthcheck:
test: ["CMD-SHELL", "pg_isready -U membership"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
rabbitmq:
image: rabbitmq:4-management-alpine
ports:
- "5672:5672"
- "15672:15672"
environment:
RABBITMQ_DEFAULT_USER: ${RMQ_USER:-guest}
RABBITMQ_DEFAULT_PASS: ${RMQ_PASS:-guest}
volumes:
postgres-data:
Monitoring and Observability
Metrics (Prometheus + Grafana)
| Category | Metrics |
|---|---|
| Application | Request rate, error rate, latency (p50/p95/p99), active sessions, JVM heap, GC pauses |
| Business | Active members, new registrations/day, check-ins/hour, billing success rate, open debt total |
| Infrastructure | CPU, memory, disk I/O, network, pod restarts, database connections |
| External | Cash360 API latency, Cash360 error rate, email delivery rate |
Infrastructure Monitoring (Icinga)
In addition to Prometheus/Grafana for application and business metrics, Icinga provides classical infrastructure monitoring with a focus on external service checks and SSL/TLS certificate oversight:
| Check | Frequency | Alert Threshold | Action |
|---|---|---|---|
| SSL/TLS certificate expiry (all domains) | Every 6 hours | < 14 days remaining | P2 — trigger Dehydrated renewal |
| SSL certificate chain validity | Daily | Invalid chain or weak cipher | P1 — investigate immediately |
| External HTTP endpoint availability | Every 1 min | 3 consecutive failures | P1 — DNS failover, investigate |
| DNS resolution (membership-one.com) | Every 5 min | Resolution failure | P1 — check Cloudflare |
| SMTP delivery (outbound email) | Every 15 min | Delivery failure | P2 — check SMTP provider |
| Cash360 API reachability | Every 2 min | 3 consecutive failures | P1 — queue locally, escalate |
| Hetzner API status | Every 5 min | API errors | P2 — check Hetzner status page |
| PostgreSQL replication lag (to fsn1) | Every 5 min | Lag > 60 seconds | P2 — investigate network/load |
Icinga deployment: Self-hosted on Hetzner infra node (Docker container). Checks run from outside the Kubernetes cluster to provide an independent monitoring perspective. Integrates with Grafana for unified dashboards and with Telegram/email for alerting.
Monitoring layers: Prometheus handles application metrics and pod-level health (inside the cluster). Icinga handles infrastructure checks and external endpoint monitoring (outside the cluster). Together they provide defense-in-depth observability.
TLS Certificate Management (Dehydrated)
All HTTPS/TLS certificates for origin servers (Hetzner Load Balancer, internal services) are managed via Dehydrated, a lightweight ACME shell client for Let's Encrypt:
| Domain | Certificate Source | Renewal Method |
|---|---|---|
*.membership-one.com (edge) |
Cloudflare (automatic) | Managed by Cloudflare |
*.membership-one.com (origin) |
Let's Encrypt via Dehydrated | ACME DNS-01 challenge (Cloudflare DNS API) |
*.membership.internal (internal) |
Let's Encrypt via Dehydrated | ACME DNS-01 challenge |
| GitLab, Vaultwarden, monitoring UIs | Let's Encrypt via Dehydrated | ACME HTTP-01 or DNS-01 |
Dehydrated configuration: - Runs as a cron job (daily check, renews at 30 days before expiry) - DNS-01 challenge via Cloudflare API hook (supports wildcard certificates) - Certificates deployed to Hetzner Load Balancer via Hetzner API post-hook - Kubernetes Ingress certificates updated via kubectl post-hook - All renewals logged and monitored by Icinga (certificate expiry check) - Failure alerts sent via email and Telegram
Why Dehydrated over cert-manager: Dehydrated is simpler to operate for a small team, works both inside and outside Kubernetes, and the team already has operational experience with it. cert-manager remains a viable alternative for pure Kubernetes environments.
Alerting Rules
| Alert | Condition | Severity | Action |
|---|---|---|---|
| High error rate | HTTP 5xx > 5% for 5 min | Critical | Page on-call |
| API latency | p95 > 2s for 10 min | Warning | Investigate |
| Database connections | Pool utilization > 80% | Warning | Scale or investigate |
| Billing failure rate | > 10% failures in billing cycle | Critical | Page on-call |
| Cash360 unreachable | No successful response for 5 min | Critical | Queue locally, page |
| Disk usage | > 85% on any volume | Warning | Expand or clean up |
| Certificate expiry | < 14 days to expiry (Icinga check) | Warning | Dehydrated auto-renews; investigate if renewal fails |
| Pod restart loop | > 3 restarts in 10 min | Critical | Investigate |
Logging
- Structured JSON logging via SLF4J + Logback
- Log aggregation via Loki (Grafana stack) or ELK (Elasticsearch, Logstash, Kibana)
- Correlation IDs: Every request gets a unique
X-Request-Idheader, propagated through all service calls and logged in every log line - PII redaction: Automatic log filter that masks email addresses, IBANs, and phone numbers in log output
- Retention: Application logs retained for 30 days, audit logs retained for 7 years
Health Checks
The application exposes Spring Actuator health endpoints:
| Endpoint | Purpose |
|---|---|
/actuator/health/liveness |
Is the process alive? (Kubernetes liveness probe) |
/actuator/health/readiness |
Can the process handle requests? (Kubernetes readiness probe) |
/actuator/health |
Full health status (DB, Redis, RabbitMQ, Cash360 connectivity) |
/actuator/prometheus |
Metrics scrape endpoint for Prometheus |
/actuator/info |
Application version, build time, git commit |
Database Infrastructure
PostgreSQL Configuration
| Parameter | Development | Staging | Production |
|---|---|---|---|
| Version | 18 (Alpine) | 18 (Managed) | 18 (Managed, HA) |
| Instance | Docker container | Single node | Primary + read replica |
| Storage | Local volume | 50 GB SSD | 200 GB SSD, auto-expand |
| Max connections | 20 | 50 | 200 |
| Backups | None | Daily | Continuous WAL archiving (PITR) |
| Encryption | None | At rest (AES-256) | At rest + in transit |
Flyway Migrations
Database schema changes are managed exclusively through Flyway migrations:
- Location:
src/main/resources/db/migration/ - Naming:
V{version}__{description}.sql(e.g.,V001__initial_schema.sql) - Execution: Automatically on application startup (dev/staging) or via init container (production)
- Validation:
flyway.validateOnMigrate=trueensures schema consistency - Rollback scripts: Every migration
V{n}__*.sqlhas a correspondingR{n}__*.sqlrollback script, tested in CI
Redis Configuration
| Use Case | Configuration |
|---|---|
| Session storage | TTL = 30 minutes, serialization = JSON |
| Rate limiting | Sliding window counters, TTL = 1 minute |
| Caching | Entity settings: TTL = 5 minutes. Member data: not cached (GDPR) |
| Distributed locks | Redisson for billing cycle leader election |
Environment Configuration
Development Environments
| Environment | Purpose | URL | Deployment |
|---|---|---|---|
| Local | Developer machine | localhost:8080 |
Docker Compose |
| Dev | Integration testing | dev.membership.internal |
Auto-deploy on push to develop |
| Staging | Pre-production validation | staging.membership.internal |
Auto-deploy on push to main |
| Production | Live system | api.membership-one.com |
Canary deployment, manual approval |
Configuration Management
Configuration follows the 12-factor app methodology: all environment-specific values are injected via environment variables, never hardcoded.
| Configuration Source | Priority | Use Case |
|---|---|---|
| Environment variables | Highest | Secrets, database URLs, external API keys |
| Kubernetes ConfigMaps | Medium | Feature flags, non-secret configuration |
application.yml |
Lowest | Default values, structure |
Secret management: Application secrets (database passwords, API keys, JWT private keys, encryption keys) are stored in Kubernetes Secrets encrypted with SOPS (Sealed Secrets) and mounted as environment variables into pods at startup. Team-accessed credentials (admin panels, third-party portals, infrastructure access) are stored in Vaultwarden (self-hosted Bitwarden-compatible password manager). For on-premises deployments, HashiCorp Vault serves both purposes. See Chapter 13 for the full credential management model.
CDN and Static Assets
CDN Strategy
| Content Type | CDN Caching | TTL |
|---|---|---|
| Flutter web assets (JS, CSS, WASM) | Yes | 1 year (versioned filenames) |
| Member profile photos | Yes | 1 hour (cache-busted on update) |
| Generated PDFs (contracts, invoices) | No | Not cached (authenticated access) |
| Entity branding assets (logos, themes) | Yes | 1 day |
| API responses | No | Dynamic, not cacheable |
Offline Mode
Mobile apps support offline operation for critical workflows:
| Workflow | Offline Support | Sync Strategy |
|---|---|---|
| Check-in | Full (cached member list) | Background sync when online |
| Attendance marking | Full (local storage) | Queue + sync on reconnect |
| Member lookup | Partial (basic profile data) | Stale data with "offline" indicator |
| Payment recording | Queue only (no processing) | Process on reconnect |
| Course booking | Not available offline | Requires server validation |
Scalability
Horizontal Scaling
| Component | Scaling Trigger | Min | Max |
|---|---|---|---|
| API pods | CPU > 70% or requests > 100/s/pod | 2 | 10 |
| Worker pods | Queue depth > 100 messages | 1 | 3 |
| Database read replicas | Read query latency > 100ms | 0 | 2 |
Multi-Region Readiness
The initial deployment targets Hetzner Cloud Nuremberg (nbg1) for Germany/Austria, with disaster recovery in Falkenstein (fsn1). The architecture supports multi-region expansion:
- Data residency: PostgreSQL replication to regional instances. Tenant data pinned to a specific region.
- CDN: Edge locations across Europe for static assets.
- DNS routing: GeoDNS to route users to nearest API cluster.
- Compliance: Per-region configuration for tax rules, payment methods, and data protection requirements.
Regional expansion is not required for v1.0 but the architecture avoids decisions that would prevent it (e.g., no region-specific hardcoding, all timestamps in UTC, locale-aware formatting at presentation layer).