Infrastructure and Deployment

Overview

The Membership platform is designed for cloud-native deployment on Hetzner Cloud (Kubernetes) while maintaining the option for on-premises installation. The infrastructure supports a small team (4 people) by maximizing automation: infrastructure as code, CI/CD pipelines, automated testing, and zero-downtime deployments. The initial target market is Germany and Austria, with the architecture ready for multi-region European expansion.


Deployment Architecture

Cloud-Native (Primary)

graph TB subgraph "Internet" USERS[Users / Mobile Apps / Web] end subgraph "Edge" CDN[CDN - Cloudflare / CloudFront] WAF[WAF - Web Application Firewall] end subgraph "Kubernetes Cluster" ING[Ingress Controller - nginx] subgraph "Application Pods" API1[API Pod 1] API2[API Pod 2] API3[API Pod N] WORKER[Worker Pod - Billing / Import] SCHEDULER[Scheduler Pod - Cron Jobs] end subgraph "Infrastructure Pods" REDIS[Redis - Cache / Sessions] MQ[RabbitMQ - Messaging] end end subgraph "Managed Services" DB[(PostgreSQL - Managed)] S3[(Object Storage - Hetzner S3)] VAULT[Secrets - Sealed Secrets] VW[Vaultwarden - Team Credentials] MON[Monitoring - Prometheus/Grafana + Icinga] LOG[Logging - Loki/ELK] end USERS --> CDN --> WAF --> ING ING --> API1 ING --> API2 ING --> API3 API1 --> DB API1 --> REDIS API1 --> MQ API1 --> S3 API1 --> VAULT WORKER --> DB WORKER --> MQ SCHEDULER --> DB SCHEDULER --> MQ API1 --> MON API1 --> LOG

Container Architecture

Each component runs in its own Docker container:

Container Base Image Purpose Replicas
membership-api eclipse-temurin:25-jre-alpine REST API server 2+ (HPA)
membership-worker eclipse-temurin:25-jre-alpine Background jobs (billing, import, sync) 1-2
membership-scheduler eclipse-temurin:25-jre-alpine Cron jobs (billing cycle, status sync) 1 (leader election)
membership-migrate flyway/flyway:latest Database migrations (init container) 1 (run-to-completion)

On-Premises Option

For organizations that require data sovereignty or have existing infrastructure:

Component On-Premises Alternative
Kubernetes Docker Compose or single-node K3s
Managed PostgreSQL Self-hosted PostgreSQL 18
Object Storage MinIO (S3-compatible)
Secrets Manager HashiCorp Vault
CDN nginx with caching
Monitoring Self-hosted Prometheus + Grafana stack

A single Docker Compose file is provided for small deployments (< 500 members, single organization). For larger deployments, the Kubernetes manifests are the recommended path.

Hetzner Cloud Specifications

The primary production deployment uses Hetzner Cloud, selected for its German data center locations, competitive pricing, and GDPR compliance (data stays in Germany).

Component Hetzner Product Specification Monthly Cost
App Nodes (3x) CX32 4 vCPU, 8 GB RAM, 80 GB NVMe 3 x EUR 15.59 = EUR 46.77
Infra Nodes (2x) CX22 2 vCPU, 4 GB RAM, 40 GB NVMe 2 x EUR 5.39 = EUR 10.78
Database Managed PostgreSQL (CPX21) 3 vCPU, 4 GB RAM, 80 GB NVMe, PITR EUR 17.85
Load Balancer LB11 25 concurrent connections, TLS termination EUR 6.41
Object Storage Hetzner S3 ~100 GB (documents, backups, uploads) ~EUR 3.00
Floating IPs 2x IPv4 Ingress + Admin access 2 x EUR 5.05 = EUR 10.10
Cloud Network Private VLAN 10.0.0.0/16 (3 subnets) EUR 0.00
Total Hetzner ~EUR 94.91
Cloudflare Free Plan DNS, CDN, WAF, DDoS protection EUR 0.00
Domains .com + .de 2 domains ~EUR 3.00
Total Infrastructure ~EUR 100/month

Locations: - Primary: Nuremberg (nbg1) -- all production workloads - DR: Falkenstein (fsn1) -- backup replication, failover target - RTO: 4 hours | RPO: 1 hour


CI/CD Pipeline

Pipeline Architecture

graph LR DEV[Developer Push] --> GIT[GitLab] GIT --> CI[CI Pipeline] subgraph "CI Pipeline" BUILD[Build + Unit Tests] LINT[Code Quality - SonarQube] SEC[Security Scan - Trivy/Snyk] INT[Integration Tests] IMG[Build Docker Image] end CI --> REG[Container Registry] REG --> CD[CD Pipeline] subgraph "CD Pipeline" DEV_ENV[Deploy to Dev] SMOKE[Smoke Tests] STG[Deploy to Staging] E2E[E2E Tests] PROD[Deploy to Production] end CD --> PROD

Pipeline Stages

Stage Tool Duration Target Gate
Compile Maven < 2 min Compilation success
Unit Tests JUnit 5 + JaCoCo < 5 min All pass, coverage > 80%
Code Quality SonarQube < 3 min No new Critical/Blocker issues
Security Scan Trivy (container) + Snyk (dependencies) < 3 min No Critical CVEs
Integration Tests Testcontainers (PostgreSQL, Redis, RabbitMQ) < 10 min All pass
Build Image Docker BuildKit < 3 min Image builds successfully
Deploy Dev Kubernetes (kubectl/Helm) < 2 min Pods healthy
Smoke Tests REST Assured < 2 min Health + key endpoints respond
Deploy Staging Kubernetes (Helm) < 2 min Pods healthy
E2E Tests Playwright (Flutter Web) < 15 min All pass
Deploy Production Kubernetes (Helm, canary) < 5 min Canary metrics healthy

Total pipeline target: < 50 minutes from push to production.

Deployment Strategy

  • Development: Direct deployment on every merge to develop branch
  • Staging: Automatic deployment on every merge to main branch
  • Production: Canary deployment (10% traffic for 15 minutes, automatic rollback on error rate > 1%)

Rollback

  • Automated: If canary deployment shows elevated error rates or latency, Kubernetes automatically rolls back to the previous version
  • Manual: helm rollback membership <revision> for instant rollback to any previous release
  • Database: Flyway migrations are forward-only. Rollback scripts are pre-written for each migration and tested in CI.

Docker Compose (Local Development)

# docker-compose.yml (for local development)
services:
  postgres:
    image: postgres:18-alpine
    environment:
      POSTGRES_DB: membership
      POSTGRES_USER: membership
      POSTGRES_PASSWORD: ${DB_PASSWORD:-devpass}
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql/data
      - ./db/init:/docker-entrypoint-initdb.d
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U membership"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]

  rabbitmq:
    image: rabbitmq:4-management-alpine
    ports:
      - "5672:5672"
      - "15672:15672"
    environment:
      RABBITMQ_DEFAULT_USER: ${RMQ_USER:-guest}
      RABBITMQ_DEFAULT_PASS: ${RMQ_PASS:-guest}

volumes:
  postgres-data:

Monitoring and Observability

Metrics (Prometheus + Grafana)

Category Metrics
Application Request rate, error rate, latency (p50/p95/p99), active sessions, JVM heap, GC pauses
Business Active members, new registrations/day, check-ins/hour, billing success rate, open debt total
Infrastructure CPU, memory, disk I/O, network, pod restarts, database connections
External Cash360 API latency, Cash360 error rate, email delivery rate

Infrastructure Monitoring (Icinga)

In addition to Prometheus/Grafana for application and business metrics, Icinga provides classical infrastructure monitoring with a focus on external service checks and SSL/TLS certificate oversight:

Check Frequency Alert Threshold Action
SSL/TLS certificate expiry (all domains) Every 6 hours < 14 days remaining P2 — trigger Dehydrated renewal
SSL certificate chain validity Daily Invalid chain or weak cipher P1 — investigate immediately
External HTTP endpoint availability Every 1 min 3 consecutive failures P1 — DNS failover, investigate
DNS resolution (membership-one.com) Every 5 min Resolution failure P1 — check Cloudflare
SMTP delivery (outbound email) Every 15 min Delivery failure P2 — check SMTP provider
Cash360 API reachability Every 2 min 3 consecutive failures P1 — queue locally, escalate
Hetzner API status Every 5 min API errors P2 — check Hetzner status page
PostgreSQL replication lag (to fsn1) Every 5 min Lag > 60 seconds P2 — investigate network/load

Icinga deployment: Self-hosted on Hetzner infra node (Docker container). Checks run from outside the Kubernetes cluster to provide an independent monitoring perspective. Integrates with Grafana for unified dashboards and with Telegram/email for alerting.

Monitoring layers: Prometheus handles application metrics and pod-level health (inside the cluster). Icinga handles infrastructure checks and external endpoint monitoring (outside the cluster). Together they provide defense-in-depth observability.

TLS Certificate Management (Dehydrated)

All HTTPS/TLS certificates for origin servers (Hetzner Load Balancer, internal services) are managed via Dehydrated, a lightweight ACME shell client for Let's Encrypt:

Domain Certificate Source Renewal Method
*.membership-one.com (edge) Cloudflare (automatic) Managed by Cloudflare
*.membership-one.com (origin) Let's Encrypt via Dehydrated ACME DNS-01 challenge (Cloudflare DNS API)
*.membership.internal (internal) Let's Encrypt via Dehydrated ACME DNS-01 challenge
GitLab, Vaultwarden, monitoring UIs Let's Encrypt via Dehydrated ACME HTTP-01 or DNS-01

Dehydrated configuration: - Runs as a cron job (daily check, renews at 30 days before expiry) - DNS-01 challenge via Cloudflare API hook (supports wildcard certificates) - Certificates deployed to Hetzner Load Balancer via Hetzner API post-hook - Kubernetes Ingress certificates updated via kubectl post-hook - All renewals logged and monitored by Icinga (certificate expiry check) - Failure alerts sent via email and Telegram

Why Dehydrated over cert-manager: Dehydrated is simpler to operate for a small team, works both inside and outside Kubernetes, and the team already has operational experience with it. cert-manager remains a viable alternative for pure Kubernetes environments.

Alerting Rules

Alert Condition Severity Action
High error rate HTTP 5xx > 5% for 5 min Critical Page on-call
API latency p95 > 2s for 10 min Warning Investigate
Database connections Pool utilization > 80% Warning Scale or investigate
Billing failure rate > 10% failures in billing cycle Critical Page on-call
Cash360 unreachable No successful response for 5 min Critical Queue locally, page
Disk usage > 85% on any volume Warning Expand or clean up
Certificate expiry < 14 days to expiry (Icinga check) Warning Dehydrated auto-renews; investigate if renewal fails
Pod restart loop > 3 restarts in 10 min Critical Investigate

Logging

  • Structured JSON logging via SLF4J + Logback
  • Log aggregation via Loki (Grafana stack) or ELK (Elasticsearch, Logstash, Kibana)
  • Correlation IDs: Every request gets a unique X-Request-Id header, propagated through all service calls and logged in every log line
  • PII redaction: Automatic log filter that masks email addresses, IBANs, and phone numbers in log output
  • Retention: Application logs retained for 30 days, audit logs retained for 7 years

Health Checks

The application exposes Spring Actuator health endpoints:

Endpoint Purpose
/actuator/health/liveness Is the process alive? (Kubernetes liveness probe)
/actuator/health/readiness Can the process handle requests? (Kubernetes readiness probe)
/actuator/health Full health status (DB, Redis, RabbitMQ, Cash360 connectivity)
/actuator/prometheus Metrics scrape endpoint for Prometheus
/actuator/info Application version, build time, git commit

Database Infrastructure

PostgreSQL Configuration

Parameter Development Staging Production
Version 18 (Alpine) 18 (Managed) 18 (Managed, HA)
Instance Docker container Single node Primary + read replica
Storage Local volume 50 GB SSD 200 GB SSD, auto-expand
Max connections 20 50 200
Backups None Daily Continuous WAL archiving (PITR)
Encryption None At rest (AES-256) At rest + in transit

Flyway Migrations

Database schema changes are managed exclusively through Flyway migrations:

  • Location: src/main/resources/db/migration/
  • Naming: V{version}__{description}.sql (e.g., V001__initial_schema.sql)
  • Execution: Automatically on application startup (dev/staging) or via init container (production)
  • Validation: flyway.validateOnMigrate=true ensures schema consistency
  • Rollback scripts: Every migration V{n}__*.sql has a corresponding R{n}__*.sql rollback script, tested in CI

Redis Configuration

Use Case Configuration
Session storage TTL = 30 minutes, serialization = JSON
Rate limiting Sliding window counters, TTL = 1 minute
Caching Entity settings: TTL = 5 minutes. Member data: not cached (GDPR)
Distributed locks Redisson for billing cycle leader election

Environment Configuration

Development Environments

Environment Purpose URL Deployment
Local Developer machine localhost:8080 Docker Compose
Dev Integration testing dev.membership.internal Auto-deploy on push to develop
Staging Pre-production validation staging.membership.internal Auto-deploy on push to main
Production Live system api.membership-one.com Canary deployment, manual approval

Configuration Management

Configuration follows the 12-factor app methodology: all environment-specific values are injected via environment variables, never hardcoded.

Configuration Source Priority Use Case
Environment variables Highest Secrets, database URLs, external API keys
Kubernetes ConfigMaps Medium Feature flags, non-secret configuration
application.yml Lowest Default values, structure

Secret management: Application secrets (database passwords, API keys, JWT private keys, encryption keys) are stored in Kubernetes Secrets encrypted with SOPS (Sealed Secrets) and mounted as environment variables into pods at startup. Team-accessed credentials (admin panels, third-party portals, infrastructure access) are stored in Vaultwarden (self-hosted Bitwarden-compatible password manager). For on-premises deployments, HashiCorp Vault serves both purposes. See Chapter 13 for the full credential management model.


CDN and Static Assets

CDN Strategy

Content Type CDN Caching TTL
Flutter web assets (JS, CSS, WASM) Yes 1 year (versioned filenames)
Member profile photos Yes 1 hour (cache-busted on update)
Generated PDFs (contracts, invoices) No Not cached (authenticated access)
Entity branding assets (logos, themes) Yes 1 day
API responses No Dynamic, not cacheable

Offline Mode

Mobile apps support offline operation for critical workflows:

Workflow Offline Support Sync Strategy
Check-in Full (cached member list) Background sync when online
Attendance marking Full (local storage) Queue + sync on reconnect
Member lookup Partial (basic profile data) Stale data with "offline" indicator
Payment recording Queue only (no processing) Process on reconnect
Course booking Not available offline Requires server validation

Scalability

Horizontal Scaling

Component Scaling Trigger Min Max
API pods CPU > 70% or requests > 100/s/pod 2 10
Worker pods Queue depth > 100 messages 1 3
Database read replicas Read query latency > 100ms 0 2

Multi-Region Readiness

The initial deployment targets Hetzner Cloud Nuremberg (nbg1) for Germany/Austria, with disaster recovery in Falkenstein (fsn1). The architecture supports multi-region expansion:

  • Data residency: PostgreSQL replication to regional instances. Tenant data pinned to a specific region.
  • CDN: Edge locations across Europe for static assets.
  • DNS routing: GeoDNS to route users to nearest API cluster.
  • Compliance: Per-region configuration for tax rules, payment methods, and data protection requirements.

Regional expansion is not required for v1.0 but the architecture avoids decisions that would prevent it (e.g., no region-specific hardcoding, all timestamps in UTC, locale-aware formatting at presentation layer).