Alibaba Cloud ACK supports both public and internal SLB.

I. Background

  • You have an ACK cluster.
  • Nginx Ingress Controller has been successfully deployed and bound to a public-facing SLB.

Note: Kubernetes clusters created via the Alibaba Cloud Container Service console automatically deploy an Nginx Ingress Controller during initialization, which is default-mounted to a public SLB instance.

II. Configuration

1. Create an Internal SLB

In the Alibaba Cloud console, create an internal SLB and bind it to your VPC.

2. Configure Nginx Ingress Controller

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# my-nginx-ingress-slb-intranet.yaml
# intranet nginx ingress slb service
apiVersion: v1
kind: Service
metadata:
  # Name the service as nginx-ingress-lb-intranet.
  name: nginx-ingress-lb-intranet
  namespace: kube-system
  labels:
    app: nginx-ingress-lb-intranet
  annotations:
    # Specify the SLB instance type as internal.
    service.beta.kubernetes.io/alicloud-loadbalancer-address-type: intranet
    # Replace with your internal SLB instance ID.
    service.beta.kubernetes.io/alicloud-loadbalancer-id: <YOUR_INTRANET_SLB_ID>
    # Whether to automatically create SLB port listeners (overrides existing ones); can also be configured manually.
    #service.beta.kubernetes.io/alicloud-loadbalancer-force-override-listeners: 'false'
spec:
  type: LoadBalancer
  # Route traffic to other nodes
  externalTrafficPolicy: "Cluster"
  ports:
  - port: 80
    name: http
    targetPort: 80
  - port: 443
    name: https
    targetPort: 443
  selector:
    # Select pods with app=ingress-nginx
    app: ingress-nginx

Apply the service resource:

Aggregating Prometheus Alert Messages Using Prometheus Alertmanager

Deploy PrometheusAlert

1
2
3
4
git clone https://github.com/feiyu563/PrometheusAlert.git
cd PrometheusAlert/example/helm/prometheusalert
# Update config/app.conf to set login user info and database configuration
helm install -n monitoring .

Create a WeChat Work Group Robot

After creating a WeChat Work group, right-click the group → “Add Group Robot”. This will generate a webhook URL for the robot. Record this URL for later use.

Develop a Kubernetes cluster backup strategy

For backups, every internet company’s technical team must handle this task, and we are no exception. Today, I’ll share my own strategies for backing up production Kubernetes clusters.

My primary goals for Kubernetes backups are to prevent:

  • Accidental deletion of a namespace within the cluster
  • Accidental misconfiguration causing resource anomalies (e.g., deployments, configmaps)
  • Accidental deletion of partial resources in the cluster
  • Loss of etcd data

Backing Up etcd

Backing up etcd prevents catastrophic failures at the cluster level or loss of etcd data, which could render the entire cluster unusable. In such cases, only full cluster recovery can restore services.

How to quickly set up a Greenplum cluster

Recently, our internal project has been supporting a big data initiative, requiring the simulation of customer scenarios using Greenplum (older version 4.2.2.4). Below is a record of the Greenplum cluster setup process—note that the procedure for higher versions of GP remains largely identical.

Building Base Image

CentOS 6 Dockerfile:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
FROM centos:6

RUN mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
RUN curl -o /etc/yum.repos.d/CentOS-Base.repo https://www.xmpan.com/Centos-6-Vault-Aliyun.repo
RUN yum -y update; yum clean all
RUN yum install -y \
    net-tools \
    ntp \
    openssh-server \
    openssh-clients \
    less \
    iproute \
    lsof \
    wget \
    ed \
    which; yum clean all
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N ''
RUN groupadd gpadmin
RUN useradd gpadmin -g gpadmin
RUN echo gpadmin | passwd gpadmin --stdin
ENTRYPOINT ["/usr/sbin/sshd", "-D"]

Build image:

Alibaba Cloud Shared GPU Solution Testing

I. Deploy GPU Sharing Plugin in Kubernetes

Before deployment, ensure that nvidia-driver and nvidia-docker are installed on your Kubernetes nodes, and Docker’s default runtime has been set to nvidia.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# cat /etc/docker/daemon.json
{
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia"
}

1. Install gpushare-device-plugin via Helm

1
2
3
$ git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
$ cd gpushare-scheduler-extender/deployer/chart
$ helm install --name gpushare --namespace kube-system --set masterCount=3 gpushare-installer

2. Label GPU Nodes

1
2
$ kubectl label node sd-cluster-04 gpushare=true
$ kubectl label node sd-cluster-05 gpushare=true

3. Install kubectl-inspect-gpushare

Ensure kubectl is already installed (omitted here).

Deploying a High-Availability Kubernetes Cluster with kubeadm

To facilitate later verification of private deployment, a quick Kubernetes cluster setup is required in the internal network environment. Previously, for larger clusters, I typically used Kubeasz or Kubespray. For this small-scale cluster, using kubeadm will be more efficient.

Below is the recorded process for deploying with kubeadm:

Cluster Nodes:

1
2
3
4
192.168.1.206 sd-cluster-206 node
192.168.1.207 sd-cluster-207 master,etcd
192.168.1.208 sd-cluster-208 master,etcd,haproxy,keepalived
192.168.1.209 sd-cluster-209 master,etcd,haproxy,keepalived

Image Versions:

1
2
3
4
5
6
7
8
9
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.18.3
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.18.3
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.18.3
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.18.3
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.6.5
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.3-0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.2
docker pull registry.cn-shanghai.aliyuncs.com/gcr-k8s/flannel:v0.14.0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/nginx-ingress-controller:v0.48.1

I. Basic Environment Setup

1. Install Docker and Configure Hosts

1
2
3
4
5
6
yum install -y yum-utils device-mapper-persistent-data lvm2 git
yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
yum install docker-ce -y
systemctl start docker
systemctl enable docker
systemctl status docker

2. Configure /etc/hosts

1
2
3
4
5
cat >> /etc/hosts << hhhh
192.168.1.207 sd-cluster-207
192.168.1.208 sd-cluster-208
192.168.1.209 sd-cluster-209
hhhh

3. Disable Firewall and Set SELinux

1
2
3
4
systemctl stop firewalld
systemctl disable firewalld
setenforce 0
sed -i 's/SELINUX=permissive/SELINUX=disabled/' /etc/sysconfig/selinux

4. Disable Swap

Kubernetes 1.8+ requires disabling swap. If not disabled, kubelet will fail to start by default.
Option 1: Use --fail-swap-on=false in kubelet startup args.
Option 2: Disable system swap.

Implementing Internal DNS with Alibaba Cloud PrivateZone + Bind9 + Dnsmasq

Requirements:

  • Alibaba Cloud cluster can resolve internal domain names
  • Office network resolves internal domain names + internet access resolution

Solution:

  • For the first requirement, directly use Alibaba Cloud PrivateZone for resolution.
  • For the second requirement, configure internal domain zones in PrivateZone, then synchronize them to the office network’s bind9 server using Alibaba Cloud’s synchronization tool. Use Dnsmasq as the DNS entry point for the office network: forward public queries to public DNS servers, and forward internal domain queries to the bind9 server.

Some may wonder: Why not use bind9 alone to handle all internal resolutions? The main reason is that in practice, bind9 exhibits performance issues when forwarding to multiple DNS servers simultaneously—occasional timeouts occur. In contrast, Dnsmasq handles this scenario significantly better.

Getting Started with Argo Events

Previously, we introduced how to install Argo Workflow and trigger tasks. In this article, we focus on a new tool:

What is ArgoEvents?

Argo Events is an event-driven Kubernetes workflow automation framework. It supports over 20 different event sources (e.g., webhooks, S3 drops, cronjobs, message queues such as Kafka, GCP PubSub, SNS, SQS, etc.).

Features:

  • Supports events from over 20 event sources and more than 10 trigger types.
  • Enables customization of business-level constraints for workflow automation.
  • Manages everything from simple, linear, real-time workflows to complex, multi-source event scenarios.
  • Complies with the CloudEvents specification.

Components:

  • EventSource (similar to a gateway; sends messages to the event bus)
  • EventBus (event message queue; implemented using high-performance distributed messaging middleware NATS — note that NATS has ceased maintenance after 2023, so architectural changes may be expected in the future)
  • EventSensor (subscribes to the message queue, parameterizes events, and filters them)

Deploying ArgoEvents

Deploy argo-events:

1
2
kubectl create ns argo-events
kubectl apply -n argo-events -f https://raw.githubusercontent.com/argoproj/argo-events/v1.2.3/manifests/install.yaml

Deploy argo-eventbus:

1
kubectl apply -n argo-events -f https://raw.githubusercontent.com/argoproj/argo-events/stable/examples/eventbus/native.yaml

RBAC Account Authorization

Create operate-workflow-sa account

Grant operate-workflow-sa permission to create Argo Workflows within the argo-events namespace — required for EventSensor to automatically create workflows later.