探究kubernetes Pod的delete机制

2020-09-23

字数统计: 3.2k字 | 阅读时长≈ 16分

概述

容器的大肆流行让环境部署变得简单起来，但是技术人员还是不满足与部署简单，还行在大规模集群应用中，应用可以很好的在集群中自动伸缩和故障自动调度，在这一块，容器的调度和伸缩是通过启动新容器和删除旧容器实现的，但是对于自动伸缩而言，容器中的服务完全有可能在一些请求没有处理完的时候被回收掉,那作为软件开发者应该怎么去避免这种情况呢？这篇文章，我们从实战出发，探究kubernetes的pod回收机制。

捕捉容器进程退出信号

编写测试代码

这里我们编写一个小程序模拟容器调度时回收的退出信号，我们猜想，容器停止的信号可能有这些os.Interrupt, os.Kill, syscall.SIGUSR1, syscall.SIGUSR2,syscall.SIGTERM

code

package main

import (
    "fmt"
    "os"
    "os/signal"
    "syscall"
)

func main()  {
    c := make(chan os.Signal)
    signal.Notify(c, os.Interrupt, os.Kill, syscall.SIGUSR1, syscall.SIGUSR2,syscall.SIGTERM)
    fmt.Println("启动程序")
    s := <-c
    fmt.Println("退出信号", s)
}

打包应用，并推送到仓库

Dockerfile

ARG APPLICATION=signal-go
FROM golang:1.12.5 AS build
ARG APPLICATION
WORKDIR /go/src/${APPLICATION}
COPY . /go/src/${APPLICATION}
RUN export GO111MODULE=on;GOOS=linux CGO_ENABLED=0 GOARCH=amd64 go build -mod=vendor -o /usr/bin/${APPLICATION} main.go

FROM alpine:3.8 AS release
ARG APPLICATION
ENV TZ=Asia/Shanghai
COPY --from=build /usr/bin/${APPLICATION}  /opt/${APPLICATION}

ENTRYPOINT ["/opt/signal-go"]

docker 打包

$ docker build -t chulinx/signal-go:v0.1 .
Sending build context to Docker daemon  283.1kB
Step 1/11 : ARG APPLICATION=signal-go
Step 2/11 : FROM golang:1.12.5 AS build
 ---> 1ef078f0da9e
Step 3/11 : ARG APPLICATION
 ---> Using cache
 ---> 50710c6b7fa1
Step 4/11 : WORKDIR /go/src/${APPLICATION}
 ---> Using cache
 ---> ec17dbabb46a
Step 5/11 : COPY . /go/src/${APPLICATION}
 ---> a5de26852fba
Step 6/11 : RUN export GO111MODULE=on;GOOS=linux CGO_ENABLED=0 GOARCH=amd64 go build -mod=vendor -o /usr/bin/${APPLICATION} main.go
 ---> Running in 367db58b580b
Removing intermediate container 367db58b580b
 ---> cff8ca46278e
Step 7/11 : FROM alpine:3.8 AS release
 ---> c8bccc0af957
Step 8/11 : ARG APPLICATION
 ---> Using cache
 ---> d0eb5d21c76f
Step 9/11 : ENV TZ=Asia/Shanghai
 ---> Using cache
 ---> 51beac203926
Step 10/11 : COPY --from=build /usr/bin/${APPLICATION}  /opt/${APPLICATION}
 ---> ed4bd9452627
Step 11/11 : ENTRYPOINT ["/opt/signal-go"]
 ---> Running in bad91de7f32f
Removing intermediate container bad91de7f32f
 ---> eab08e41aa93
Successfully built eab08e41aa93
Successfully tagged chulinx/signal-go:v0.1

$ docker push chulinx/signal-go:v0.1
The push refers to repository [docker.io/chulinx/signal-go]
91a9e9441356: Pushed
7444ea29e45e: Layer already exists
v0.1: digest: sha256:e9cee8bbee27f7584307e8548acde6d47713ae91fb9e5bc592af4b976e0fe158 size: 739

部署应用

可以看到我们部署了一个副本数为3的deployment，并成功运行

$ kubectl create deployment signal-go --image chulinx/signal-go:v0.1 --replicas 3
deployment.apps/signal-go created
$ kubectl get pod  -owide
NAME                        READY   STATUS    RESTARTS   AGE   IP            NODE              NOMINATED NODE   READINESS GATES
signal-go-64884df4d-7dr52   1/1     Running   0          57s   10.244.3.18   qd-k8s-rebuild2   <none>           <none>
signal-go-64884df4d-cz28n   1/1     Running   0          57s   10.244.2.25   qd-k8s-rebuild4   <none>           <none>
signal-go-64884df4d-fbrh6   1/1     Running   0          57s   10.244.1.57   qd-k8s-rebuild3   <none>           <none>

模拟应用自动缩容，查看应用容器

通过scale 将deployment的副本数设置为2，signal-go-64884df4d-fbrh6被删除，可以看到捕捉的信号是terminated

$ kubectl scale deployment signal-go --replicas 2
deployment.apps/signal-go scaled
$ kubetail signal
Will tail 3 logs...
signal-go-64884df4d-7dr52
signal-go-64884df4d-cz28n
signal-go-64884df4d-fbrh6

[signal-go-64884df4d-fbrh6] 退出信号 terminated

容器引擎在k8s调度删除容器的时候在干什么

查看docker服务日志

可以看到docker容器引擎执行了两个事件,猜想，应该是将任务发送给更底层的containerd去执行

docker 日志

1
2

9月 25 08:29:59 qd-k8s-rebuild3 dockerd[28519]: time="2020-09-25T08:29:59.701571791Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
9月 25 08:30:00 qd-k8s-rebuild3 dockerd[28519]: time="2020-09-25T08:30:00.013225242Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

containerd 日志

9月 25 08:29:59 qd-k8s-rebuild3 containerd[28505]: time="2020-09-25T08:29:59.577593428Z" level=debug msg="event published" ns=moby topic=/tasks/exit type=containerd.events.TaskExit
9月 25 08:29:59 qd-k8s-rebuild3 containerd[28505]: time="2020-09-25T08:29:59.691533386Z" level=info msg="shim reaped" id=3d64b8cad41f367267d91f2fa195ff22d29f3f510797dbc8c5450c8294143f27
9月 25 08:29:59 qd-k8s-rebuild3 containerd[28505]: time="2020-09-25T08:29:59.701254673Z" level=debug msg="event published" ns=moby topic=/tasks/delete type=containerd.events.TaskDelete
9月 25 08:29:59 qd-k8s-rebuild3 containerd[28505]: time="2020-09-25T08:29:59.737310161Z" level=debug msg="event published" ns=moby topic=/containers/delete type=containerd.events.ContainerDelete
9月 25 08:29:59 qd-k8s-rebuild3 containerd[28505]: time="2020-09-25T08:29:59.791663621Z" level=debug msg="garbage collected" d=1.116588ms
9月 25 08:29:59 qd-k8s-rebuild3 containerd[28505]: time="2020-09-25T08:29:59.880060485Z" level=debug msg="event published" ns=moby topic=/tasks/exit type=containerd.events.TaskExit
9月 25 08:30:00 qd-k8s-rebuild3 containerd[28505]: time="2020-09-25T08:30:00.003525755Z" level=info msg="shim reaped" id=928e62f142c4f09d9891dde995ea003301aa2131fd6cf233a027187ddbaee18b
9月 25 08:30:00 qd-k8s-rebuild3 containerd[28505]: time="2020-09-25T08:30:00.013078844Z" level=debug msg="event published" ns=moby topic=/tasks/delete type=containerd.events.TaskDelete
9月 25 08:30:00 qd-k8s-rebuild3 containerd[28505]: time="2020-09-25T08:30:00.056521853Z" level=debug msg="event published" ns=moby topic=/containers/delete type=containerd.events.ContainerDelete
9月 25 08:30:00 qd-k8s-rebuild3 containerd[28505]: time="2020-09-25T08:30:00.096039564Z" level=debug msg="garbage collected" d=1.096439ms

程序捕捉进程退出信号

模拟程序捕捉退出信号，不立即退出进程，观察docker多久会强制删除容器进程

编写测试代码，打包镜像

package main

import (
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"
)

// 优雅退出go守护进程
func main()  {
	//创建监听退出chan
	c := make(chan os.Signal)
	//监听指定信号 ctrl+c kill
	signal.Notify(c, syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT, syscall.SIGUSR1, syscall.SIGUSR2)
	go func() {
		for s := range c {
			switch s {
			case syscall.SIGHUP, syscall.SIGINT,  syscall.SIGQUIT:
				fmt.Println("退出", s)
				ExitFunc()
			case syscall.SIGTERM:
				fmt.Println("signal: ",s)
				fmt.Println("不退出...")
			case syscall.SIGUSR1:
				fmt.Println("usr1", s)
			case syscall.SIGUSR2:
				fmt.Println("usr2", s)
			default:
				fmt.Println("other", s)
			}
		}
	}()

	fmt.Println("进程启动...")
	sum := 0
	for {
		sum++
		fmt.Println("sum:", sum)
		time.Sleep(time.Second)
	}
}

func ExitFunc() {
	fmt.Println("开始退出...")
	fmt.Println("执行清理...")
	fmt.Println("结束退出...")
	os.Exit(0)
}

1 2	$ docker build -t chulinx/signal-go:v0.3 . $ docker push chulinx/signal-go:v0.3

启动v2测试应用

$ kubectl create deployment signal-go-v2 --image chulinx/signal-go:v0.2 --replicas 3
 kubectl get pod
NAME                           READY   STATUS    RESTARTS   AGE
signal-go-64884df4d-7dr52      1/1     Running   0          139m
signal-go-64884df4d-8tmvw      1/1     Running   0          22m
signal-go-64884df4d-cz28n      1/1     Running   0          139m
signal-go-v2-68f8d449b-8jl4v   1/1     Running   0          38s
signal-go-v2-68f8d449b-8qb59   1/1     Running   0          38s
signal-go-v2-68f8d449b-gs56h   1/1     Running   0          38s

模拟容器自动调度

kubectl scale deployment signal-go-v3 –replicas 2,手动缩放副本为2，可以看到pod并没有立即退出，查看docker日志发现以下信息

$ kubectl get pod -owide
signal-go-v3-7b587d5944-h4sjr   1/1     Running       0          74s    10.244.2.27   qd-k8s-rebuild4   <none>           <none>
signal-go-v3-7b587d5944-zfj9l   1/1     Running       0          74s    10.244.3.20   qd-k8s-rebuild2   <none>           <none>
signal-go-v3-7b587d5944-zswn5   1/1     Terminating   0          74s    10.244.1.66   qd-k8s-rebuild3   <none>           <none>
9月 25 10:00:33 qd-k8s-rebuild3 dockerd[28304]: time="2020-09-25T10:00:33.568021655Z" level=info msg="Container 180440b110c9640900e32ecaa2398967a9f6c7da2f58df752b27bbd9810db515 failed to exit within 30 seconds of signal 15 - using the force"

这句话的意思是，使用信号15去退出容器进程超过30秒没有退出，使用强制退出

查看docker官方源码

整个逻辑大概是，dockerd开始会以一个StopSignal信号去停止容器，如果停止容器返回错误，dockerd会等待2秒，会调用daemon.killPossiblyDeadProcess(container, 9)，强制删除容器，如果返回错误为nil，如果超过函数传入的时间，就会调用daemon.Kill去停止容器

func (daemon *Daemon) containerStop(container *containerpkg.Container, seconds int) error {
	if !container.IsRunning() {
		return nil
	}

	stopSignal := container.StopSignal()
	// 1. Send a stop signal
	if err := daemon.killPossiblyDeadProcess(container, stopSignal); err != nil {
		// While normally we might "return err" here we're not going to
		// because if we can't stop the container by this point then
		// it's probably because it's already stopped. Meaning, between
		// the time of the IsRunning() call above and now it stopped.
		// Also, since the err return will be environment specific we can't
		// look for any particular (common) error that would indicate
		// that the process is already dead vs something else going wrong.
		// So, instead we'll give it up to 2 more seconds to complete and if
		// by that time the container is still running, then the error
		// we got is probably valid and so we force kill it.
		ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
		defer cancel()

		if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
			logrus.Infof("Container failed to stop after sending signal %d to the process, force killing", stopSignal)
			if err := daemon.killPossiblyDeadProcess(container, 9); err != nil {
				return err
			}
		}
	}

	// 2. Wait for the process to exit on its own
	ctx := context.Background()
	if seconds >= 0 {
		var cancel context.CancelFunc
		ctx, cancel = context.WithTimeout(ctx, time.Duration(seconds)*time.Second)
		defer cancel()
	}

	if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
		logrus.Infof("Container %v failed to exit within %d seconds of signal %d - using the force", container.ID, seconds, stopSignal)
		// 3. If it doesn't, then send SIGKILL
		if err := daemon.Kill(container); err != nil {
			// Wait without a timeout, ignore result.
			<-container.Wait(context.Background(), containerpkg.WaitConditionNotRunning)
			logrus.Warn(err) // Don't return error because we only care that container is stopped, not what function stopped it
		}
	}

	daemon.LogContainerEvent(container, "stop")
	return nil
}

到这里我们大概就清楚容器的删除是怎么一回事了，容器的正常退出信号是传入的，容器超时多久退出也是传入的，

再来看下docker run的参数

可以看到有两个和stop相关的参数，一个是停止信号(默认就是SIGTERM)，一个是超时时间，这两个参数都是在容器启动是指定的,这也基本证实了我们的猜想

1
2
3

$ docker run --help
      --stop-signal string             Signal to stop a container (default "SIGTERM")
      --stop-timeout int               Timeout (in seconds) to stop a container

run一个容器,指定超时时间

1
2
3

$ docker run -itd --stop-timeout=80 nginx:latest
docker ps|grep nginx:latest
7f419a6a08f3        nginx:latest                            "/docker-entrypoint.…"   30 seconds ago      Up 30 seconds       80/tcp              vigorous_shirley

查看容器的配置参数

我们使用jq过滤这两个容器停止的参数，可以看到容器默认的停止信号是“SIGTERM”,默认停止超时时间是80,也就是我们启动容器时设置的stop-timeout

$ cat /var/lib/docker/containers/7f419a6a08f330436ba82b5120e66f4c56b219583d9a32f2486c9d24b8406325/config.v2.json |jq ".Config.StopTimeout"
80
$ cat /var/lib/docker/containers/7f419a6a08f330436ba82b5120e66f4c56b219583d9a32f2486c9d24b8406325/config.v2.json |jq ".Config.StopSignal"
"SIGTERM"

Kubernetes 对容器停止超时时间的封装

terminationGracePeriodSeconds默认超时时间

k8s将docker的–stop-timeout封装为
官方在描述容器生命周期的时候有一段这样的话 “行为与 PreStop 回调的行为类似。如果回调在执行过程中挂起，Pod 阶段将保持在 Terminating 状态，并在 Pod 结束的 terminationGracePeriodSeconds 之后被杀死。如果 PostStart 或 PreStop 回调失败，它会杀死容器”，这地方提到了一个terminationGracePeriodSeconds，他的默认值为30秒，我们创建一个deployment可以看到

1
2
3

$ kubectl create deployment signal-go-v3 --image chulinx/signal-go:v0.3 && kubectl get deployments.apps signal-go-v3 --template={{.spec.template.spec.terminationGracePeriodSeconds}}
deployment.apps/signal-go-v3 created
30

验证terminationGracePeriodSeconds是不是StopTimeout

我们再次到node节点上去查看容器的Config.StopTimeout

从上面docker超时删除容器的日志我们大概可以肯定 terminationGracePeriodSeconds就是StopTimeout，但是我们可以看到在容器的配置文件中并没有.Config.StopTimeout，那是为什么呢？会不会是猜测错误

$ kubectl get pod signal-go-v3-7b587d5944-9mm2k  -o jsonpath="{.status.containerStatuses[*].containerID}"
docker://943bb0a0abb0802c76973304b2620aedb2cad3841ca1e96fa06a811216178ad9
$ cat /var/lib/docker/containers/943bb0a0abb0802c76973304b2620aedb2cad3841ca1e96fa06a811216178ad9/config.v2.json |jq ".Config.StopTimeout"
null

我们更改


```bash
kubectl patch deployments.apps signal-go-v3 -p '{"spec":{"template":{"spec":{"terminationGracePeriodSeconds":80}}}}' |kubectl get deployments.apps signal-go-v3 --template={{.spec.template.spec.terminationGracePeriodSeconds}}
80
kubectl get pod
NAME                            READY   STATUS    RESTARTS   AGE
signal-go-v3-7bd97dc4cf-wqr9f   1/1     Running   0          49s

Sep 27 02:40:00 minikube dockerd[2586]: time="2020-09-27T02:40:00.604412985Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/bc423238de1fb412e656dc9b25521a063f051c0fa8fdd61893d201c00bbc19f8/shim.sock" debug=false pid=19880
Sep 27 02:40:01 minikube dockerd[2586]: time="2020-09-27T02:40:01.070067862Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/8007942d737b24836b8ea8fb2b73ce32c2052db56d190822f4aade3b646ad6d1/shim.sock" debug=false pid=19939
Sep 27 02:41:20 minikube dockerd[2586]: time="2020-09-27T02:41:20.248459175Z" level=info msg="Container d8233ddb8d4425cd6c3543a65c5012e51fc226bd16a913e0a058d4b77dde0957 failed to exit within 80 seconds of signal 15 - using the force"

可以看到，这个强制删除时间由原来的30变成我们设置的80，到这可以肯定我们terminationGracePeriodSecond就是StopTimeout但是kubelet没有在创建的时候指定stoptimeout，那是在哪里指定的？

kubelet源码给出的答案

kubelet对StopContainer的封装实现

可以看到kubelet在ds.client.StopContainer传入两个参数，分别是容器id和我们要找的超时时间，再往下看，我们就看到kubelet通过grpc获取参数和执行容器删除

// StopContainer stops a running container with a grace period (i.e., timeout).
func (ds *dockerService) StopContainer(_ context.Context, r *runtimeapi.StopContainerRequest) (*runtimeapi.StopContainerResponse, error) {
	err := ds.client.StopContainer(r.ContainerId, time.Duration(r.Timeout)*time.Second)
	if err != nil {
		return nil, err
	}
	return &runtimeapi.StopContainerResponse{}, nil
}
// grpc 调用
func _RuntimeService_StopContainer_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {
	in := new(StopContainerRequest)
	if err := dec(in); err != nil {
		return nil, err
	}
	if interceptor == nil {
		return srv.(RuntimeServiceServer).StopContainer(ctx, in)
	}
	info := &grpc.UnaryServerInfo{
		Server:     srv,
		FullMethod: "/runtime.v1alpha2.RuntimeService/StopContainer",
	}
	handler := func(ctx context.Context, req interface{}) (interface{}, error) {
		return srv.(RuntimeServiceServer).StopContainer(ctx, req.(*StopContainerRequest))
	}
	return interceptor(ctx, in, info, handler)
}

总结

最后贴出kubernetes删除pod的大概流程图

本文作者： ChuLinx
本文链接： http://yoursite.com/2020/09/23/探究kubernetes Pod的delete机制/
版权声明： 本博客所有文章除特别声明外，均采用 MIT 许可协议。转载请注明出处！