7.4 KiB
7.4 KiB
Flink High Availability Cluster Deployment
Overview
This project uses Apache Flink Kubernetes Operator to deploy a high availability Flink cluster with persistent storage and automatic failover capabilities.
Component Architecture
- JobManager: 2 replicas with high availability configuration
- TaskManager: 3 replicas for distributed processing
- High Availability: Kubernetes-based HA with persistent storage
- Checkpointing: Persistent checkpoints and savepoints storage
File Description
1. flink-operator-v2.yaml
Flink Kubernetes Operator deployment configuration:
- Operator deployment in
flink-systemnamespace - RBAC configuration for cluster-wide permissions
- Health checks and resource limits
- Enhanced CRD definitions with additional printer columns
2. flink-crd.yaml
Custom Resource Definitions for Flink:
- FlinkDeployment CRD
- FlinkSessionJob CRD
- Required for Flink Operator to function
3. ha-flink-cluster-v2.yaml
Production-ready HA Flink cluster configuration:
- 2 JobManager replicas with HA enabled
- 3 TaskManager replicas with anti-affinity rules
- Persistent storage for HA data, checkpoints, and savepoints
- Memory and CPU resource allocation
- Exponential delay restart strategy
- Proper volume mounts and storage configuration
4. simple-ha-flink-cluster.yaml
Simplified HA Flink cluster configuration:
- Uses ephemeral storage to avoid PVC binding issues
- Basic HA configuration for testing and development
- Minimal resource requirements
- Recommended for development and testing
5. flink-storage.yaml
Storage and RBAC configuration:
- PersistentVolumeClaims for HA data, checkpoints, and savepoints
- ServiceAccount and RBAC permissions for Flink cluster
- Azure Disk storage class configuration with correct access modes
6. flink-rbac.yaml
Enhanced RBAC configuration:
- Complete permissions for Flink HA functionality
- Both namespace-level and cluster-level permissions
- Includes watch permissions for HA operations
Deployment Steps
1. Install Flink Operator
# Apply Flink Operator configuration
kubectl apply -f flink-operator-v2.yaml
# Verify operator installation
kubectl get pods -n flink-system
2. Create Storage Resources (Optional - for production)
# Apply storage configuration
kubectl apply -f flink-storage.yaml
# Verify PVC creation
kubectl get pvc -n freeleaps-data-platform
3. Deploy HA Flink Cluster
# Option A: Deploy with persistent storage (production)
kubectl apply -f ha-flink-cluster-v2.yaml
# Option B: Deploy with ephemeral storage (development/testing)
kubectl apply -f simple-ha-flink-cluster.yaml
# Check deployment status
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink
High Availability Features
- JobManager HA: 2 JobManager replicas with Kubernetes-based leader election
- Persistent State: Checkpoints and savepoints stored on persistent volumes
- Automatic Failover: Exponential delay restart strategy with backoff
- Pod Anti-affinity: Ensures components are distributed across different nodes
- Storage Persistence: HA data, checkpoints, and savepoints persist across restarts
Network Configuration
- JobManager: Port 8081 (Web UI), 6123 (RPC), 6124 (Blob Server)
- TaskManager: Port 6121 (Data), 6122 (RPC), 6126 (Metrics)
- Service Type: ClusterIP for internal communication
Storage Configuration
- HA Data: 10Gi for high availability metadata
- Checkpoints: 20Gi for application checkpoints
- Savepoints: 20Gi for manual savepoints
- Storage Class: azure-disk-std-ssd-lrs
- Access Mode: ReadWriteOnce (Azure Disk limitation)
Monitoring and Operations
- Health Checks: Built-in readiness and liveness probes
- Web UI: Accessible through JobManager service
- Metrics: Exposed on port 8080 for Prometheus collection
- Logging: Centralized logging through Kubernetes
Configuration Details
High Availability Settings
- Type: kubernetes (native Kubernetes HA)
- Storage: Persistent volume for HA metadata
- Cluster ID: ha-flink-cluster-v2
Checkpointing Configuration
- Interval: 60 seconds
- Timeout: 10 minutes
- Min Pause: 5 seconds
- Backend: Filesystem with persistent storage
Resource Allocation
- JobManager: 0.5 CPU, 1024MB memory (HA), 1.0 CPU, 1024MB memory (Simple)
- TaskManager: 0.5 CPU, 2048MB memory (HA), 2.0 CPU, 2048MB memory (Simple)
Troubleshooting
Common Issues and Solutions
1. PVC Binding Issues
# Check PVC status
kubectl get pvc -n freeleaps-data-platform
# PVC stuck in Pending state - usually due to:
# - Insufficient storage quota
# - Wrong access mode (ReadWriteMany not supported by Azure Disk)
# - Storage class not available
# Solution: Use ReadWriteOnce access mode or ephemeral storage
2. Pod CrashLoopBackOff
# Check pod status
kubectl get pods -n freeleaps-data-platform -l app=flink
# Check pod logs
kubectl logs <pod-name> -n freeleaps-data-platform
# Check pod events
kubectl describe pod <pod-name> -n freeleaps-data-platform
3. ServiceAccount Issues
# Verify ServiceAccount exists
kubectl get serviceaccount -n freeleaps-data-platform
# Check RBAC permissions
kubectl get rolebinding -n freeleaps-data-platform
4. Storage Path Issues
# Ensure storage paths match volume mounts
# For persistent storage: /opt/flink/ha-data, /opt/flink/checkpoints
# For ephemeral storage: /tmp/flink/ha-data, /tmp/flink/checkpoints
Diagnostic Commands
# Check Flink Operator logs
kubectl logs -n flink-system -l app.kubernetes.io/name=flink-kubernetes-operator
# Check Flink cluster status
kubectl describe flinkdeployment <cluster-name> -n freeleaps-data-platform
# Check pod events
kubectl get events -n freeleaps-data-platform --sort-by='.lastTimestamp'
# Check storage status
kubectl get pvc -n freeleaps-data-platform
kubectl describe pvc <pvc-name> -n freeleaps-data-platform
# Check operator status
kubectl get pods -n flink-system
kubectl logs -n flink-system deployment/flink-kubernetes-operator
Important Notes
- Storage Limitations: Azure Disk storage class only supports ReadWriteOnce access mode
- ServiceAccount: Ensure the correct ServiceAccount is specified in cluster configuration
- Resource Requirements: Verify cluster has enough CPU/memory for all replicas
- Network Policies: May need adjustment for inter-pod communication
- Ephemeral vs Persistent: Use ephemeral storage for development/testing, persistent for production
Quick Start (Recommended for Testing)
# 1. Deploy operator
kubectl apply -f flink-operator-v2.yaml
# 2. Wait for operator to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=flink-kubernetes-operator -n flink-system
# 3. Deploy simple HA cluster (no persistent storage)
kubectl apply -f simple-ha-flink-cluster.yaml
# 4. Monitor deployment
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink
Production Deployment
# 1. Deploy operator
kubectl apply -f flink-operator-v2.yaml
# 2. Deploy storage resources
kubectl apply -f flink-storage.yaml
# 3. Deploy production HA cluster
kubectl apply -f ha-flink-cluster-v2.yaml
# 4. Monitor deployment
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink