Infrastructure and Deployment

Overview

The Membership platform is designed for cloud-native deployment on Hetzner Cloud (Kubernetes) while maintaining the option for on-premises installation. The infrastructure supports a small team (4 people) by maximizing automation: infrastructure as code, CI/CD pipelines, automated testing, and zero-downtime deployments. The initial target market is Germany and Austria, with the architecture ready for multi-region European expansion.

Deployment Architecture

Cloud-Native (Primary)

graph TB subgraph "Internet" USERS[Users / Mobile Apps / Web] end subgraph "Edge" CDN[CDN - Cloudflare / CloudFront] WAF[WAF - Web Application Firewall] end subgraph "Kubernetes Cluster" ING[Ingress Controller - nginx] subgraph "Application Pods" API1[API Pod 1] API2[API Pod 2] API3[API Pod N] WORKER[Worker Pod - Billing / Import] SCHEDULER[Scheduler Pod - Cron Jobs] end subgraph "Infrastructure Pods" REDIS[Redis - Cache / Sessions] MQ[RabbitMQ - Messaging] end end subgraph "Managed Services" DB[(PostgreSQL - Managed)] S3[(Object Storage - Hetzner S3)] VAULT[Secrets - Sealed Secrets] VW[Vaultwarden - Team Credentials] MON[Monitoring - Prometheus/Grafana + Icinga] LOG[Logging - Loki/ELK] end USERS --> CDN --> WAF --> ING ING --> API1 ING --> API2 ING --> API3 API1 --> DB API1 --> REDIS API1 --> MQ API1 --> S3 API1 --> VAULT WORKER --> DB WORKER --> MQ SCHEDULER --> DB SCHEDULER --> MQ API1 --> MON API1 --> LOG

Container Architecture

Each component runs in its own Docker container:

Container	Base Image	Purpose	Replicas
`membership-api`	`eclipse-temurin:25-jre-alpine`	REST API server	2+ (HPA)
`membership-worker`	`eclipse-temurin:25-jre-alpine`	Background jobs (billing, import, sync)	1-2
`membership-scheduler`	`eclipse-temurin:25-jre-alpine`	Cron jobs (billing cycle, status sync)	1 (leader election)
`membership-migrate`	`flyway/flyway:latest`	Database migrations (init container)	1 (run-to-completion)

On-Premises Option

For organizations that require data sovereignty or have existing infrastructure:

Component	On-Premises Alternative
Kubernetes	Docker Compose or single-node K3s
Managed PostgreSQL	Self-hosted PostgreSQL 18
Object Storage	MinIO (S3-compatible)
Secrets Manager	HashiCorp Vault
CDN	nginx with caching
Monitoring	Self-hosted Prometheus + Grafana stack

A single Docker Compose file is provided for small deployments (< 500 members, single organization). For larger deployments, the Kubernetes manifests are the recommended path.

Hetzner Cloud Specifications

The primary production deployment uses Hetzner Cloud, selected for its German data center locations, competitive pricing, and GDPR compliance (data stays in Germany).

Component	Hetzner Product	Specification	Monthly Cost
App Nodes (3x)	CX32	4 vCPU, 8 GB RAM, 80 GB NVMe	3 x EUR 15.59 = EUR 46.77
Infra Nodes (2x)	CX22	2 vCPU, 4 GB RAM, 40 GB NVMe	2 x EUR 5.39 = EUR 10.78
Database	Managed PostgreSQL (CPX21)	3 vCPU, 4 GB RAM, 80 GB NVMe, PITR	EUR 17.85
Load Balancer	LB11	25 concurrent connections, TLS termination	EUR 6.41
Object Storage	Hetzner S3	~100 GB (documents, backups, uploads)	~EUR 3.00
Floating IPs	2x IPv4	Ingress + Admin access	2 x EUR 5.05 = EUR 10.10
Cloud Network	Private VLAN	10.0.0.0/16 (3 subnets)	EUR 0.00
Total Hetzner			~EUR 94.91
Cloudflare	Free Plan	DNS, CDN, WAF, DDoS protection	EUR 0.00
Domains	.com + .de	2 domains	~EUR 3.00
Total Infrastructure			~EUR 100/month

Locations: - Primary: Nuremberg (nbg1) -- all production workloads - DR: Falkenstein (fsn1) -- backup replication, failover target - RTO: 4 hours | RPO: 1 hour

CI/CD Pipeline

Pipeline Architecture

graph LR DEV[Developer Push] --> GIT[GitLab] GIT --> CI[CI Pipeline] subgraph "CI Pipeline" BUILD[Build + Unit Tests] LINT[Code Quality - SonarQube] SEC[Security Scan - Trivy/Snyk] INT[Integration Tests] IMG[Build Docker Image] end CI --> REG[Container Registry] REG --> CD[CD Pipeline] subgraph "CD Pipeline" DEV_ENV[Deploy to Dev] SMOKE[Smoke Tests] STG[Deploy to Staging] E2E[E2E Tests] PROD[Deploy to Production] end CD --> PROD

Pipeline Stages

Stage	Tool	Duration Target	Gate
Compile	Maven	< 2 min	Compilation success
Unit Tests	JUnit 5 + JaCoCo	< 5 min	All pass, coverage > 80%
Code Quality	SonarQube	< 3 min	No new Critical/Blocker issues
Security Scan	Trivy (container) + Snyk (dependencies)	< 3 min	No Critical CVEs
Integration Tests	Testcontainers (PostgreSQL, Redis, RabbitMQ)	< 10 min	All pass
Build Image	Docker BuildKit	< 3 min	Image builds successfully
Deploy Dev	Kubernetes (kubectl/Helm)	< 2 min	Pods healthy
Smoke Tests	REST Assured	< 2 min	Health + key endpoints respond
Deploy Staging	Kubernetes (Helm)	< 2 min	Pods healthy
E2E Tests	Playwright (Flutter Web)	< 15 min	All pass
Deploy Production	Kubernetes (Helm, canary)	< 5 min	Canary metrics healthy

Total pipeline target: < 50 minutes from push to production.

Deployment Strategy

Development: Direct deployment on every merge to develop branch
Staging: Automatic deployment on every merge to main branch
Production: Canary deployment (10% traffic for 15 minutes, automatic rollback on error rate > 1%)

Rollback

Automated: If canary deployment shows elevated error rates or latency, Kubernetes automatically rolls back to the previous version
Manual: helm rollback membership <revision> for instant rollback to any previous release
Database: Flyway migrations are forward-only. Rollback scripts are pre-written for each migration and tested in CI.

Docker Compose (Local Development)

# docker-compose.yml (for local development)
services:
  postgres:
    image: postgres:18-alpine
    environment:
      POSTGRES_DB: membership
      POSTGRES_USER: membership
      POSTGRES_PASSWORD: ${DB_PASSWORD:-devpass}
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql/data
      - ./db/init:/docker-entrypoint-initdb.d
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U membership"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]

  rabbitmq:
    image: rabbitmq:4-management-alpine
    ports:
      - "5672:5672"
      - "15672:15672"
    environment:
      RABBITMQ_DEFAULT_USER: ${RMQ_USER:-guest}
      RABBITMQ_DEFAULT_PASS: ${RMQ_PASS:-guest}

volumes:
  postgres-data:

Monitoring and Observability

Metrics (Prometheus + Grafana)

Category	Metrics
Application	Request rate, error rate, latency (p50/p95/p99), active sessions, JVM heap, GC pauses
Business	Active members, new registrations/day, check-ins/hour, billing success rate, open debt total
Infrastructure	CPU, memory, disk I/O, network, pod restarts, database connections
External	Cash360 API latency, Cash360 error rate, email delivery rate

Infrastructure Monitoring (Icinga)

In addition to Prometheus/Grafana for application and business metrics, Icinga provides classical infrastructure monitoring with a focus on external service checks and SSL/TLS certificate oversight:

Check	Frequency	Alert Threshold	Action
SSL/TLS certificate expiry (all domains)	Every 6 hours	< 14 days remaining	P2 — trigger Dehydrated renewal
SSL certificate chain validity	Daily	Invalid chain or weak cipher	P1 — investigate immediately
External HTTP endpoint availability	Every 1 min	3 consecutive failures	P1 — DNS failover, investigate
DNS resolution (membership-one.com)	Every 5 min	Resolution failure	P1 — check Cloudflare
SMTP delivery (outbound email)	Every 15 min	Delivery failure	P2 — check SMTP provider
Cash360 API reachability	Every 2 min	3 consecutive failures	P1 — queue locally, escalate
Hetzner API status	Every 5 min	API errors	P2 — check Hetzner status page
PostgreSQL replication lag (to fsn1)	Every 5 min	Lag > 60 seconds	P2 — investigate network/load

Icinga deployment: Self-hosted on Hetzner infra node (Docker container). Checks run from outside the Kubernetes cluster to provide an independent monitoring perspective. Integrates with Grafana for unified dashboards and with Telegram/email for alerting.

Monitoring layers: Prometheus handles application metrics and pod-level health (inside the cluster). Icinga handles infrastructure checks and external endpoint monitoring (outside the cluster). Together they provide defense-in-depth observability.

TLS Certificate Management (Dehydrated)

All HTTPS/TLS certificates for origin servers (Hetzner Load Balancer, internal services) are managed via Dehydrated, a lightweight ACME shell client for Let's Encrypt:

Domain	Certificate Source	Renewal Method
`*.membership-one.com` (edge)	Cloudflare (automatic)	Managed by Cloudflare
`*.membership-one.com` (origin)	Let's Encrypt via Dehydrated	ACME DNS-01 challenge (Cloudflare DNS API)
`*.membership.internal` (internal)	Let's Encrypt via Dehydrated	ACME DNS-01 challenge
GitLab, Vaultwarden, monitoring UIs	Let's Encrypt via Dehydrated	ACME HTTP-01 or DNS-01

Dehydrated configuration: - Runs as a cron job (daily check, renews at 30 days before expiry) - DNS-01 challenge via Cloudflare API hook (supports wildcard certificates) - Certificates deployed to Hetzner Load Balancer via Hetzner API post-hook - Kubernetes Ingress certificates updated via kubectl post-hook - All renewals logged and monitored by Icinga (certificate expiry check) - Failure alerts sent via email and Telegram

Why Dehydrated over cert-manager: Dehydrated is simpler to operate for a small team, works both inside and outside Kubernetes, and the team already has operational experience with it. cert-manager remains a viable alternative for pure Kubernetes environments.

Alerting Rules

Alert	Condition	Severity	Action
High error rate	HTTP 5xx > 5% for 5 min	Critical	Page on-call
API latency	p95 > 2s for 10 min	Warning	Investigate
Database connections	Pool utilization > 80%	Warning	Scale or investigate
Billing failure rate	> 10% failures in billing cycle	Critical	Page on-call
Cash360 unreachable	No successful response for 5 min	Critical	Queue locally, page
Disk usage	> 85% on any volume	Warning	Expand or clean up
Certificate expiry	< 14 days to expiry (Icinga check)	Warning	Dehydrated auto-renews; investigate if renewal fails
Pod restart loop	> 3 restarts in 10 min	Critical	Investigate

Logging

Structured JSON logging via SLF4J + Logback
Log aggregation via Loki (Grafana stack) or ELK (Elasticsearch, Logstash, Kibana)
Correlation IDs: Every request gets a unique X-Request-Id header, propagated through all service calls and logged in every log line
PII redaction: Automatic log filter that masks email addresses, IBANs, and phone numbers in log output
Retention: Application logs retained for 30 days, audit logs retained for 7 years

Health Checks

The application exposes Spring Actuator health endpoints:

Endpoint	Purpose
`/actuator/health/liveness`	Is the process alive? (Kubernetes liveness probe)
`/actuator/health/readiness`	Can the process handle requests? (Kubernetes readiness probe)
`/actuator/health`	Full health status (DB, Redis, RabbitMQ, Cash360 connectivity)
`/actuator/prometheus`	Metrics scrape endpoint for Prometheus
`/actuator/info`	Application version, build time, git commit

Database Infrastructure

PostgreSQL Configuration

Parameter	Development	Staging	Production
Version	18 (Alpine)	18 (Managed)	18 (Managed, HA)
Instance	Docker container	Single node	Primary + read replica
Storage	Local volume	50 GB SSD	200 GB SSD, auto-expand
Max connections	20	50	200
Backups	None	Daily	Continuous WAL archiving (PITR)
Encryption	None	At rest (AES-256)	At rest + in transit

Flyway Migrations

Database schema changes are managed exclusively through Flyway migrations:

Location: src/main/resources/db/migration/
Naming: V{version}__{description}.sql (e.g., V001__initial_schema.sql)
Execution: Automatically on application startup (dev/staging) or via init container (production)
Validation: flyway.validateOnMigrate=true ensures schema consistency
Rollback scripts: Every migration V{n}__*.sql has a corresponding R{n}__*.sql rollback script, tested in CI

Redis Configuration

Use Case	Configuration
Session storage	TTL = 30 minutes, serialization = JSON
Rate limiting	Sliding window counters, TTL = 1 minute
Caching	Entity settings: TTL = 5 minutes. Member data: not cached (GDPR)
Distributed locks	Redisson for billing cycle leader election

Environment Configuration

Development Environments

Environment	Purpose	URL	Deployment
Local	Developer machine	`localhost:8080`	Docker Compose
Dev	Integration testing	`dev.membership.internal`	Auto-deploy on push to `develop`
Staging	Pre-production validation	`staging.membership.internal`	Auto-deploy on push to `main`
Production	Live system	`api.membership-one.com`	Canary deployment, manual approval

Configuration Management

Configuration follows the 12-factor app methodology: all environment-specific values are injected via environment variables, never hardcoded.

Configuration Source	Priority	Use Case
Environment variables	Highest	Secrets, database URLs, external API keys
Kubernetes ConfigMaps	Medium	Feature flags, non-secret configuration
`application.yml`	Lowest	Default values, structure

Secret management: Application secrets (database passwords, API keys, JWT private keys, encryption keys) are stored in Kubernetes Secrets encrypted with SOPS (Sealed Secrets) and mounted as environment variables into pods at startup. Team-accessed credentials (admin panels, third-party portals, infrastructure access) are stored in Vaultwarden (self-hosted Bitwarden-compatible password manager). For on-premises deployments, HashiCorp Vault serves both purposes. See Chapter 13 for the full credential management model.

CDN and Static Assets

CDN Strategy

Content Type	CDN Caching	TTL
Flutter web assets (JS, CSS, WASM)	Yes	1 year (versioned filenames)
Member profile photos	Yes	1 hour (cache-busted on update)
Generated PDFs (contracts, invoices)	No	Not cached (authenticated access)
Entity branding assets (logos, themes)	Yes	1 day
API responses	No	Dynamic, not cacheable

Offline Mode

Mobile apps support offline operation for critical workflows:

Workflow	Offline Support	Sync Strategy
Check-in	Full (cached member list)	Background sync when online
Attendance marking	Full (local storage)	Queue + sync on reconnect
Member lookup	Partial (basic profile data)	Stale data with "offline" indicator
Payment recording	Queue only (no processing)	Process on reconnect
Course booking	Not available offline	Requires server validation

Scalability

Horizontal Scaling

Component	Scaling Trigger	Min	Max
API pods	CPU > 70% or requests > 100/s/pod	2	10
Worker pods	Queue depth > 100 messages	1	3
Database read replicas	Read query latency > 100ms	0	2

Multi-Region Readiness

The initial deployment targets Hetzner Cloud Nuremberg (nbg1) for Germany/Austria, with disaster recovery in Falkenstein (fsn1). The architecture supports multi-region expansion:

Data residency: PostgreSQL replication to regional instances. Tenant data pinned to a specific region.
CDN: Edge locations across Europe for static assets.
DNS routing: GeoDNS to route users to nearest API cluster.
Compliance: Per-region configuration for tax rules, payment methods, and data protection requirements.

Regional expansion is not required for v1.0 but the architecture avoids decisions that would prevent it (e.g., no region-specific hardcoding, all timestamps in UTC, locale-aware formatting at presentation layer).