284 lines
6.2 KiB
Markdown
284 lines
6.2 KiB
Markdown
|
|
# Azure Kubernetes Node Addition Runbook
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
This runbook provides step-by-step instructions for adding new Azure Virtual Machines to an existing Kubernetes cluster installed via Kubespray.
|
||
|
|
|
||
|
|
## Prerequisites
|
||
|
|
- Access to Azure CLI with appropriate permissions
|
||
|
|
- SSH access to the new VM
|
||
|
|
- Access to the existing Kubernetes cluster
|
||
|
|
- Kubespray installation directory
|
||
|
|
|
||
|
|
## Pre-Installation Checklist
|
||
|
|
|
||
|
|
### 1. Verify New VM Details
|
||
|
|
```bash
|
||
|
|
# Get VM details from Azure
|
||
|
|
az vm show --resource-group <RESOURCE_GROUP> --name <VM_NAME> --query "{name:name,ip:publicIps,privateIp:privateIps}" -o table
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Verify SSH Access
|
||
|
|
```bash
|
||
|
|
# Test SSH connection to the new VM
|
||
|
|
ssh wwwadmin@mathmast.com@<VM_PRIVATE_IP>
|
||
|
|
# You will be prompted for password
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Verify Network Connectivity
|
||
|
|
```bash
|
||
|
|
# From the new VM, test connectivity to existing cluster
|
||
|
|
ping <EXISTING_MASTER_IP>
|
||
|
|
```
|
||
|
|
|
||
|
|
## Step-by-Step Process
|
||
|
|
|
||
|
|
### Step 1: Update Ansible Inventory
|
||
|
|
|
||
|
|
1. **Navigate to Kubespray directory**
|
||
|
|
```bash
|
||
|
|
cd freeleaps-ops/3rd/kubespray
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Edit the inventory file**
|
||
|
|
```bash
|
||
|
|
vim ../cluster/ansible/manifests/inventory.ini
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Add the new node to the appropriate group**
|
||
|
|
|
||
|
|
For a worker node:
|
||
|
|
```ini
|
||
|
|
[kube_node]
|
||
|
|
# Existing nodes...
|
||
|
|
prod-usw2-k8s-freeleaps-worker-nodes-06 ansible_host=<NEW_VM_PRIVATE_IP> ansible_user=wwwadmin@mathmast.com host_name=prod-usw2-k8s-freeleaps-worker-nodes-06
|
||
|
|
```
|
||
|
|
|
||
|
|
For a master node:
|
||
|
|
```ini
|
||
|
|
[kube_control_plane]
|
||
|
|
# Existing nodes...
|
||
|
|
prod-usw2-k8s-freeleaps-master-03 ansible_host=<NEW_VM_PRIVATE_IP> ansible_user=wwwadmin@mathmast.com etcd_member_name=freeleaps-etcd-03 host_name=prod-usw2-k8s-freeleaps-master-03
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 2: Verify Inventory Configuration
|
||
|
|
|
||
|
|
1. **Check inventory syntax**
|
||
|
|
```bash
|
||
|
|
ansible-inventory -i ../cluster/ansible/manifests/inventory.ini --list
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Test connectivity to new node**
|
||
|
|
```bash
|
||
|
|
ansible -i ../cluster/ansible/manifests/inventory.ini kube_node -m ping -kK
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 3: Run Kubespray Scale Playbook
|
||
|
|
|
||
|
|
1. **Execute the scale playbook**
|
||
|
|
```bash
|
||
|
|
cd ../cluster/ansible/manifests
|
||
|
|
ansible-playbook -i inventory.ini ../../3rd/kubespray/scale.yml -kK -b
|
||
|
|
```
|
||
|
|
|
||
|
|
**Note**:
|
||
|
|
- `-k` prompts for SSH password
|
||
|
|
- `-K` prompts for sudo password
|
||
|
|
- `-b` enables privilege escalation
|
||
|
|
|
||
|
|
### Step 4: Verify Node Addition
|
||
|
|
|
||
|
|
1. **Check node status**
|
||
|
|
```bash
|
||
|
|
kubectl get nodes
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Verify node is ready**
|
||
|
|
```bash
|
||
|
|
kubectl describe node <NEW_NODE_NAME>
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Check node labels**
|
||
|
|
```bash
|
||
|
|
kubectl get nodes --show-labels
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 5: Post-Installation Verification
|
||
|
|
|
||
|
|
1. **Test pod scheduling**
|
||
|
|
```bash
|
||
|
|
# Create a test pod to verify scheduling
|
||
|
|
kubectl run test-pod --image=nginx --restart=Never
|
||
|
|
kubectl get pod test-pod -o wide
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Check node resources**
|
||
|
|
```bash
|
||
|
|
kubectl top nodes
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Verify node components**
|
||
|
|
```bash
|
||
|
|
kubectl get pods -n kube-system -o wide | grep <NEW_NODE_NAME>
|
||
|
|
```
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Common Issues
|
||
|
|
|
||
|
|
#### 1. SSH Connection Failed
|
||
|
|
```bash
|
||
|
|
# Verify VM is running
|
||
|
|
az vm show --resource-group <RESOURCE_GROUP> --name <VM_NAME> --query "powerState"
|
||
|
|
|
||
|
|
# Check network security groups
|
||
|
|
az network nsg rule list --resource-group <RESOURCE_GROUP> --nsg-name <NSG_NAME>
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 2. Ansible Connection Failed
|
||
|
|
```bash
|
||
|
|
# Test with verbose output
|
||
|
|
ansible -i ../cluster/ansible/manifests/inventory.ini kube_node -m ping -kK -vvv
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 3. Node Not Ready
|
||
|
|
```bash
|
||
|
|
# Check node conditions
|
||
|
|
kubectl describe node <NEW_NODE_NAME>
|
||
|
|
|
||
|
|
# Check kubelet logs
|
||
|
|
kubectl logs -n kube-system kubelet-<NEW_NODE_NAME>
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 4. Pod Scheduling Issues
|
||
|
|
```bash
|
||
|
|
# Check node taints
|
||
|
|
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
|
||
|
|
|
||
|
|
# Check node capacity
|
||
|
|
kubectl describe node <NEW_NODE_NAME> | grep -A 10 "Capacity"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Recovery Procedures
|
||
|
|
|
||
|
|
#### If Scale Playbook Fails
|
||
|
|
1. **Clean up the failed node**
|
||
|
|
```bash
|
||
|
|
kubectl delete node <NEW_NODE_NAME>
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Reset the VM**
|
||
|
|
```bash
|
||
|
|
# Reset VM to clean state
|
||
|
|
az vm restart --resource-group <RESOURCE_GROUP> --name <VM_NAME>
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Retry the scale playbook**
|
||
|
|
```bash
|
||
|
|
ansible-playbook -i inventory.ini ../../3rd/kubespray/scale.yml -kK -b
|
||
|
|
```
|
||
|
|
|
||
|
|
#### If Node is Stuck in NotReady State
|
||
|
|
1. **Check kubelet service**
|
||
|
|
```bash
|
||
|
|
ssh wwwadmin@mathmast.com@<VM_PRIVATE_IP>
|
||
|
|
sudo systemctl status kubelet
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Restart kubelet**
|
||
|
|
```bash
|
||
|
|
ssh wwwadmin@mathmast.com@<VM_PRIVATE_IP>
|
||
|
|
sudo systemctl restart kubelet
|
||
|
|
```
|
||
|
|
|
||
|
|
## Security Considerations
|
||
|
|
|
||
|
|
### 1. Network Security
|
||
|
|
- Ensure the new VM is in the correct subnet
|
||
|
|
- Verify network security group rules allow cluster communication
|
||
|
|
- Check firewall rules if applicable
|
||
|
|
|
||
|
|
### 2. Access Control
|
||
|
|
- Use SSH key-based authentication when possible
|
||
|
|
- Limit sudo access to necessary commands
|
||
|
|
- Monitor node access logs
|
||
|
|
|
||
|
|
### 3. Compliance
|
||
|
|
- Ensure the new node meets security requirements
|
||
|
|
- Verify all required security patches are applied
|
||
|
|
- Check compliance with organizational policies
|
||
|
|
|
||
|
|
## Monitoring and Maintenance
|
||
|
|
|
||
|
|
### 1. Node Health Monitoring
|
||
|
|
```bash
|
||
|
|
# Set up monitoring for the new node
|
||
|
|
kubectl get nodes -o wide
|
||
|
|
kubectl top nodes
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Resource Monitoring
|
||
|
|
```bash
|
||
|
|
# Monitor resource usage
|
||
|
|
kubectl describe node <NEW_NODE_NAME> | grep -A 5 "Allocated resources"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Log Monitoring
|
||
|
|
```bash
|
||
|
|
# Monitor kubelet logs
|
||
|
|
kubectl logs -n kube-system kubelet-<NEW_NODE_NAME> --tail=100 -f
|
||
|
|
```
|
||
|
|
|
||
|
|
## Rollback Procedures
|
||
|
|
|
||
|
|
### If Node Addition Causes Issues
|
||
|
|
|
||
|
|
1. **Cordon the node**
|
||
|
|
```bash
|
||
|
|
kubectl cordon <NEW_NODE_NAME>
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Drain the node**
|
||
|
|
```bash
|
||
|
|
kubectl drain <NEW_NODE_NAME> --ignore-daemonsets --delete-emptydir-data
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Remove the node**
|
||
|
|
```bash
|
||
|
|
kubectl delete node <NEW_NODE_NAME>
|
||
|
|
```
|
||
|
|
|
||
|
|
4. **Update inventory**
|
||
|
|
```bash
|
||
|
|
# Remove the node from inventory.ini
|
||
|
|
vim ../cluster/ansible/manifests/inventory.ini
|
||
|
|
```
|
||
|
|
|
||
|
|
## Documentation
|
||
|
|
|
||
|
|
### Required Information
|
||
|
|
- VM name and IP address
|
||
|
|
- Resource group and subscription
|
||
|
|
- Node role (worker/master)
|
||
|
|
- Date and time of addition
|
||
|
|
- Person performing the addition
|
||
|
|
|
||
|
|
### Post-Addition Checklist
|
||
|
|
- [ ] Node appears in `kubectl get nodes`
|
||
|
|
- [ ] Node status is Ready
|
||
|
|
- [ ] Pods can be scheduled on the node
|
||
|
|
- [ ] All node components are running
|
||
|
|
- [ ] Monitoring is configured
|
||
|
|
- [ ] Documentation is updated
|
||
|
|
|
||
|
|
## Emergency Contacts
|
||
|
|
|
||
|
|
- **Infrastructure Team**: [Contact Information]
|
||
|
|
- **Kubernetes Administrators**: [Contact Information]
|
||
|
|
- **Azure Support**: [Contact Information]
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Last Updated**: [Date]
|
||
|
|
**Version**: 1.0
|
||
|
|
**Author**: [Name]
|