前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Containerd深度剖析-CRI篇

Containerd深度剖析-CRI篇

作者头像
zouyee
发布2023-01-31 18:45:41
9980
发布2023-01-31 18:45:41
举报
文章被收录于专栏:Kubernetes GOKubernetes GO

撰文 | 段全锋

编辑 | zouyee

段全锋: 软件工程师,熟悉K8s架构、精通Runtime底层技术细节等。

目前我司现网的K8s集群的运行时已经完成从docker到Containerd的切换,有小伙伴对K8s与Containerd调用链涉及的组件不了解,其中Containerd和RunC是什么关系,docker和containerd又有什么区别,以及K8s调用Containerd创建容器又是怎样的流程,最终RunC又是如何创建容器的,诸如此类的疑问。本文就针对K8s使用Containerd作为运行时的整个调用链进行介绍和源码级别的分析。

其中关于kubelet与运行时的分层架构图可以参看下图

那么关于各类运行时的介绍可以参看Containerd深度剖析-runtime篇

“运行时简介”

容器运行时意思就是能够管理容器运行的整个生命周期,具体一点就是如何制作容器的镜像、容器镜像格式是什么样子的、管理容器的镜像、容器镜像的分发、如何运行一个容器以及管理创建的容器实例等等。

容器运行时有一个行业标准叫做OCI规范,这个规范分成两部分:

a. 容器运行时规范:描述了如何通过一个bundle运行容器,bundle就是一个目录,里面包括一个容器的规格文 件,文件叫config.json 和一个rootfs,rootfs中包含了一个容器运行时所需操作系统的文件。

b. 容器镜像规范:定义了容器的镜像如何打包如何将镜像转换成一个bundle。

的老爷子,考虑是新冠肺炎转成重症。虽然意识还比较清醒,但血气分析结果很差,需要住进ICU进行呼吸监护。从普通病房转运至ICU,坐电梯上几层楼就能到,但如果给氧条件不够,风险会非常大。他的妻女想见老爷子一面、说说话,怕以后再也见不到了。

目前流行将运行时分成low-level运行时和high-level运行时,low-level运行时专注于如何创建一个容器例如runc和kata,high-level包含了更多上层功能,比如镜像管理,以docker和containerd为代表。

K8s的kubelet是调用容器运行时创建容器的,但是容器运行时这么多不可能逐个兼容,K8s在对接容器运行时定义了CRI接口,容器运行时只需实现该接口就能被使用。下图分别是k8s使用docker和containerd的调用链,使用containerd时CRI接口是在containerd代码中实现的;使用docker时的CRI接口是在k8s的代码中实现的,叫做docker-shim(kubernetes/pkg/kubelet/dockershim/docker_service.go),这部分代码在k8s代码中是历史原因,当时docker是容器方面行业事实上的标准,但随着越来越多运行时实现了CRI支持,docker-shim的维护日益变成社区负担,在最新的K8s版本中,该部分代码目前已经移出,暂时由mirantis进行维护,下图是插件的发展历程。

“Containerd CRI简介”

Containerd是一个行业标准的容器运行时,它是一个daemon进程,可以管理主机上容器的全部生命周期和它的文件系统,包括:镜像的分发和存储、容器的运行和监控,底层的存储和网络。

Containerd有多种客户端,比如K8s、docker等,为了不同客户端的容器或者镜像能隔离开,Containerd中有namespace概念,默认情况下docker的namespace是moby,K8s的是k8s.io。

container在Containerd中代表的是一个容器的元数据,containerd中的Task用于获取容器对象并将它转换成在操作系统中可运行的进程,它代表的就是容器中可运行的对象。

Containerd内部的cri模块实现K8s的CRI接口,所以K8s的kubelet可以直接使用containerd。CRI的接口包括:RuntimeServiceImageService

代码语言:javascript
复制
// Runtime service defines the public APIs for remote container runtimes
service RuntimeService {
  // Version returns the runtime name, runtime version, and runtime API version.
  rpc Version(VersionRequest) returns (VersionResponse) {}

  // RunPodSandbox creates and starts a pod-level sandbox. Runtimes must ensure
  // the sandbox is in the ready state on success.
  rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}
  // StopPodSandbox stops any running process that is part of the sandbox and
  // reclaims network resources (e.g., IP addresses) allocated to the sandbox.
  // If there are any running containers in the sandbox, they must be forcibly
  // terminated.
  // This call is idempotent, and must not return an error if all relevant
  // resources have already been reclaimed. kubelet will call StopPodSandbox
  // at least once before calling RemovePodSandbox. It will also attempt to
  // reclaim resources eagerly, as soon as a sandbox is not needed. Hence,
  // multiple StopPodSandbox calls are expected.
  rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}
  // RemovePodSandbox removes the sandbox. If there are any running containers
  // in the sandbox, they must be forcibly terminated and removed.
  // This call is idempotent, and must not return an error if the sandbox has
  // already been removed.
  rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}
  // PodSandboxStatus returns the status of the PodSandbox. If the PodSandbox is not
  // present, returns an error.
  rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
  // ListPodSandbox returns a list of PodSandboxes.
  rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}

  // CreateContainer creates a new container in specified PodSandbox
  rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
  // StartContainer starts the container.
  rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}
  // StopContainer stops a running container with a grace period (i.e., timeout).
  // This call is idempotent, and must not return an error if the container has
  // already been stopped.
  // The runtime must forcibly kill the container after the grace period is
  // reached.
  rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}
  // RemoveContainer removes the container. If the container is running, the
  // container must be forcibly removed.
  // This call is idempotent, and must not return an error if the container has
  // already been removed.
  rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
  // ListContainers lists all containers by filters.
  rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}
  // ContainerStatus returns status of the container. If the container is not
  // present, returns an error.
  rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}
  // UpdateContainerResources updates ContainerConfig of the container synchronously.
  // If runtime fails to transactionally update the requested resources, an error is returned.
  rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}
  // ReopenContainerLog asks runtime to reopen the stdout/stderr log file
  // for the container. This is often called after the log file has been
  // rotated. If the container is not running, container runtime can choose
  // to either create a new log file and return nil, or return an error.
  // Once it returns error, new container log file MUST NOT be created.
  rpc ReopenContainerLog(ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}

  // ExecSync runs a command in a container synchronously.
  rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse) {}
  // Exec prepares a streaming endpoint to execute a command in the container.
  rpc Exec(ExecRequest) returns (ExecResponse) {}
  // Attach prepares a streaming endpoint to attach to a running container.
  rpc Attach(AttachRequest) returns (AttachResponse) {}
  // PortForward prepares a streaming endpoint to forward ports from a PodSandbox.
  rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}

  // ContainerStats returns stats of the container. If the container does not
  // exist, the call returns an error.
  rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse) {}
  // ListContainerStats returns stats of all running containers.
  rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse) {}

  // PodSandboxStats returns stats of the pod sandbox. If the pod sandbox does not
  // exist, the call returns an error.
  rpc PodSandboxStats(PodSandboxStatsRequest) returns (PodSandboxStatsResponse) {}
  // ListPodSandboxStats returns stats of the pod sandboxes matching a filter.
  rpc ListPodSandboxStats(ListPodSandboxStatsRequest) returns (ListPodSandboxStatsResponse) {}

  // UpdateRuntimeConfig updates the runtime configuration based on the given request.
  rpc UpdateRuntimeConfig(UpdateRuntimeConfigRequest) returns (UpdateRuntimeConfigResponse) {}

  // Status returns the status of the runtime.
  rpc Status(StatusRequest) returns (StatusResponse) {}
}

// ImageService defines the public APIs for managing images.
service ImageService {
  // ListImages lists existing images.
  rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {}
  // ImageStatus returns the status of the image. If the image is not
  // present, returns a response with ImageStatusResponse.Image set to
  // nil.
  rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {}
  // PullImage pulls an image with authentication config.
  rpc PullImage(PullImageRequest) returns (PullImageResponse) {}
  // RemoveImage removes the image.
  // This call is idempotent, and must not return an error if the image has
  // already been removed.
  rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {}
  // ImageFSInfo returns information of the filesystem that is used to store images.
  rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse) {}
}

kubelet调用CRI接口创建一个包含A和B两个业务container的Pod流程如下所示:

① 为Pod创建sandbox

② 创建container A

③ 启动container A

④ 创建container B

⑤ 启动container B

“Containerd CRI实现”

RunPodSandbox

RunPodSandbox的流程如下:

① 拉取sandbox的镜像,在containerd中配置

② 获取创建pod要使用的runtime,可以在创建pod的yaml中指定,如果没指定使用containerd中默认的

(runtime在containerd中配置)

③ 如果pod不是hostNetwork那么添加创建新net namespace,并使用cni插件设置网络(criService在初始化时会加载containerd中cri指定的插件信息)

④ 调用containerd客户端创建一个container

⑤ 在rootDir/io.containerd.grpc.v1.cri/sandboxes下为当前pod以pod Id为名创建一个目录

(pkg/cri/cri.go)

⑥ 根据选择的runtime为sandbox容器创建task

⑦ 启动sandbox容器的task,将sandbox添加到数据库中

代码在containerd/pkg/cri/server/sanbox_run.go 中

代码语言:javascript
复制
// RunPodSandbox creates and starts a pod-level sandbox. Runtimes should ensure
// the sandbox is in ready state.
func (c *criService) RunPodSandbox(ctx context.Context, r *runtime.RunPodSandboxRequest) (_ *runtime.RunPodSandboxResponse, retErr error) {
  config := r.GetConfig()
  log.G(ctx).Debugf("Sandbox config %+v", config)

  // Generate unique id and name for the sandbox and reserve the name.
  id := util.GenerateID()
  metadata := config.GetMetadata()
  if metadata == nil {
    return nil, errors.New("sandbox config must include metadata")
  }
  name := makeSandboxName(metadata)
  log.G(ctx).WithField("podsandboxid", id).Debugf("generated id for sandbox name %q", name)

  // cleanupErr records the last error returned by the critical cleanup operations in deferred functions,
  // like CNI teardown and stopping the running sandbox task.
  // If cleanup is not completed for some reason, the CRI-plugin will leave the sandbox
  // in a not-ready state, which can later be cleaned up by the next execution of the kubelet's syncPod workflow.
  var cleanupErr error

  // Reserve the sandbox name to avoid concurrent `RunPodSandbox` request starting the
  // same sandbox.
  if err := c.sandboxNameIndex.Reserve(name, id); err != nil {
    return nil, fmt.Errorf("failed to reserve sandbox name %q: %w", name, err)
  }
  defer func() {
    // Release the name if the function returns with an error.
    // When cleanupErr != nil, the name will be cleaned in sandbox_remove.
    if retErr != nil && cleanupErr == nil {
      c.sandboxNameIndex.ReleaseByName(name)
    }
  }()

  var (
    err         error
    sandboxInfo = sb.Sandbox{ID: id}
  )

  ociRuntime, err := c.getSandboxRuntime(config, r.GetRuntimeHandler())
  if err != nil {
    return nil, fmt.Errorf("unable to get OCI runtime for sandbox %q: %w", id, err)
  }

  sandboxInfo.Runtime.Name = ociRuntime.Type

  // Retrieve runtime options
  runtimeOpts, err := generateRuntimeOptions(ociRuntime, c.config)
  if err != nil {
    return nil, fmt.Errorf("failed to generate sandbox runtime options: %w", err)
  }

  ...

  // Create initial internal sandbox object.
  sandbox := sandboxstore.NewSandbox(
    ...
  )

  if _, err := c.client.SandboxStore().Create(ctx, sandboxInfo); err != nil {
    return nil, fmt.Errorf("failed to save sandbox metadata: %w", err)
  }
 
  ...

  // Setup the network namespace if host networking wasn't requested.
  if !hostNetwork(config) {
    netStart := time.Now()
    // If it is not in host network namespace then create a namespace and set the sandbox
    // handle. NetNSPath in sandbox metadata and NetNS is non empty only for non host network
    // namespaces. If the pod is in host network namespace then both are empty and should not
    // be used.
    var netnsMountDir = "/var/run/netns"
    if c.config.NetNSMountsUnderStateDir {
      netnsMountDir = filepath.Join(c.config.StateDir, "netns")
    }
    sandbox.NetNS, err = netns.NewNetNS(netnsMountDir)
    if err != nil {
      return nil, fmt.Errorf("failed to create network namespace for sandbox %q: %w", id, err)
    }
    // Update network namespace in the store, which is used to generate the container's spec
    sandbox.NetNSPath = sandbox.NetNS.GetPath()
    defer func() {
      // Remove the network namespace only if all the resource cleanup is done
      if retErr != nil && cleanupErr == nil {
        if cleanupErr = sandbox.NetNS.Remove(); cleanupErr != nil {
          log.G(ctx).WithError(cleanupErr).Errorf("Failed to remove network namespace %s for sandbox %q", sandbox.NetNSPath, id)
          return
        }
        sandbox.NetNSPath = ""
      }
    }()

    if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil {
      return nil, fmt.Errorf("unable to save sandbox %q to store: %w", id, err)
    }
    // Save sandbox metadata to store
    if sandboxInfo, err = c.client.SandboxStore().Update(ctx, sandboxInfo, "extensions"); err != nil {
      return nil, fmt.Errorf("unable to update extensions for sandbox %q: %w", id, err)
    }

    // Define this defer to teardownPodNetwork prior to the setupPodNetwork function call.
    // This is because in setupPodNetwork the resource is allocated even if it returns error, unlike other resource
    // creation functions.
    defer func() {
      // Remove the network namespace only if all the resource cleanup is done.
      if retErr != nil && cleanupErr == nil {
        deferCtx, deferCancel := ctrdutil.DeferContext()
        defer deferCancel()
        // Teardown network if an error is returned.
        if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
          log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)
        }

      }
    }()

    // Setup network for sandbox.
    // Certain VM based solutions like clear containers (Issue containerd/cri-containerd#524)
    // rely on the assumption that CRI shim will not be querying the network namespace to check the
    // network states such as IP.
    // In future runtime implementation should avoid relying on CRI shim implementation details.
    // In this case however caching the IP will add a subtle performance enhancement by avoiding
    // calls to network namespace of the pod to query the IP of the veth interface on every
    // SandboxStatus request.
    if err := c.setupPodNetwork(ctx, &sandbox); err != nil {
      return nil, fmt.Errorf("failed to setup network for sandbox %q: %w", id, err)
    }
    sandboxCreateNetworkTimer.UpdateSince(netStart)
  }

  if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil {
    return nil, fmt.Errorf("unable to save sandbox %q to store: %w", id, err)
  }

  controller, err := c.getSandboxController(config, r.GetRuntimeHandler())
  if err != nil {
    return nil, fmt.Errorf("failed to get sandbox controller: %w", err)
  }

  // Save sandbox metadata to store
  if sandboxInfo, err = c.client.SandboxStore().Update(ctx, sandboxInfo, "extensions"); err != nil {
    return nil, fmt.Errorf("unable to update extensions for sandbox %q: %w", id, err)
  }

  runtimeStart := time.Now()

  if err := controller.Create(ctx, id); err != nil {
    return nil, fmt.Errorf("failed to create sandbox %q: %w", id, err)
  }

  resp, err := controller.Start(ctx, id)
  if err != nil {
    sandbox.Container, _ = c.client.LoadContainer(ctx, id)
    if resp != nil && resp.SandboxID == "" && resp.Pid == 0 && resp.CreatedAt == nil && len(resp.Labels) == 0 {
      // if resp is a non-nil zero-value, an error occurred during cleanup
      cleanupErr = fmt.Errorf("failed to cleanup sandbox")
    }
    return nil, fmt.Errorf("failed to start sandbox %q: %w", id, err)
  }

  // TODO: get rid of this. sandbox object should no longer have Container field.
  if ociRuntime.SandboxMode == string(criconfig.ModePodSandbox) {
    container, err := c.client.LoadContainer(ctx, id)
    if err != nil {
      return nil, fmt.Errorf("failed to load container %q for sandbox: %w", id, err)
    }
    sandbox.Container = container
  }

  labels := resp.GetLabels()
  if labels == nil {
    labels = map[string]string{}
  }

  sandbox.ProcessLabel = labels["selinux_label"]

  if err := sandbox.Status.Update(func(status sandboxstore.Status) (sandboxstore.Status, error) {
    // Set the pod sandbox as ready after successfully start sandbox container.
    status.Pid = resp.Pid
    status.State = sandboxstore.StateReady
    status.CreatedAt = protobuf.FromTimestamp(resp.CreatedAt)
    return status, nil
  }); err != nil {
    return nil, fmt.Errorf("failed to update sandbox status: %w", err)
  }

  // Add sandbox into sandbox store in INIT state.
  if err := c.sandboxStore.Add(sandbox); err != nil {
    return nil, fmt.Errorf("failed to add sandbox %+v into store: %w", sandbox, err)
  }

  // Send CONTAINER_CREATED event with both ContainerId and SandboxId equal to SandboxId.
  // Note that this has to be done after sandboxStore.Add() because we need to get
  // SandboxStatus from the store and include it in the event.
  c.generateAndSendContainerEvent(ctx, id, id, runtime.ContainerEventType_CONTAINER_CREATED_EVENT)

  // start the monitor after adding sandbox into the store, this ensures
  // that sandbox is in the store, when event monitor receives the TaskExit event.
  //
  // TaskOOM from containerd may come before sandbox is added to store,
  // but we don't care about sandbox TaskOOM right now, so it is fine.
  go func() {
    resp, err := controller.Wait(ctrdutil.NamespacedContext(), id)
    if err != nil {
      log.G(ctx).WithError(err).Error("failed to wait for sandbox controller, skipping exit event")
      return
    }

    e := &eventtypes.TaskExit{
      ContainerID: id,
      ID:          id,
      // Pid is not used
      Pid:        0,
      ExitStatus: resp.ExitStatus,
      ExitedAt:   resp.ExitedAt,
    }
    c.eventMonitor.backOff.enBackOff(id, e)
  }()

  // Send CONTAINER_STARTED event with ContainerId equal to SandboxId.
  c.generateAndSendContainerEvent(ctx, id, id, runtime.ContainerEventType_CONTAINER_STARTED_EVENT)

  sandboxRuntimeCreateTimer.WithValues(labels["oci_runtime_type"]).UpdateSince(runtimeStart)

  return &runtime.RunPodSandboxResponse{PodSandboxId: id}, nil
}

CreateContainer

CreateContainer在指定的PodSandbox中创建一个新的container元数据,流程如下:

① 获取容器的sandbox信息

② 为容器要用的镜像初始化镜像handler

③ 为容器在rootDir/io.containerd.grpc.v1.cri目录下以容器Id命名的目录

④ 从sandbox中获取所使用的runtime

⑤ 为容器创建containerSpec

⑥ 使用containerd客户端创建container

⑦ 保存container的信息

代码见containerd/pkg/cri/server/container_create.go 下面是省略过的代码。

代码语言:javascript
复制
func (c *criService) CreateContainer(ctx context.Context, r
*runtime.CreateContainerRequest) (_ *runtime.CreateContainerResponse, retErr error) { 

    sandbox, err := c.sandboxStore.Get(r.GetPodSandboxId())
    s, err := sandbox.Container.Task(ctx, nil) 
    sandboxPid := s.Pid()
    image, err := c.localResolve(config.GetImage().GetImage()) 
    if err != nil {
        return nil, errors.Wrapf(err, "failed to resolve image %q", config.GetImage().GetImage())
    }
    containerdImage, err := c.toContainerdImage(ctx, image)
    // Run container using the same runtime with sandbox.
    sandboxInfo, err := sandbox.Container.Info(ctx) 
    if err != nil {
        return nil, errors.Wrapf(err, "failed to get sandbox %q info", sandboxID)
    }

// Create container root directory. containerRootDir := c.getContainerRootDir(id)
    if err = c.os.MkdirAll(containerRootDir, 0755); err != nil {
        return nil, errors.Wrapf(err, "failed to create container root directory %q", containerRootDir)
    }
    ociRuntime, err := c.getSandboxRuntime(sandboxConfig, sandbox.Metadata.RuntimeHandler) 
    if err != nil {
        return nil, errors.Wrap(err, "failed to get sandbox runtime")
    }

    spec, err := c.containerSpec(id, sandboxID, sandboxPid, sandbox.NetNSPath, containerName, containerdImage.Name(), config, sandboxConfig,
&image.ImageSpec.Config, append(mounts, volumeMounts...), ociRuntime) 
    if err != nil {
        return nil, errors.Wrapf(err, "failed to generate container %q spec", id)
    }
    opts = append(opts, containerd.WithSpec(spec, specOpts...),
containerd.WithRuntime(sandboxInfo.Runtime.Name, runtimeOptions), containerd.WithContainerLabels(containerLabels), containerd.WithContainerExtension(containerMetadataExtension, &meta))
    var cntr containerd.Container
     if cntr, err = c.client.NewContainer(ctx, id, opts...); err != nil { 
         return nil, errors.Wrap(err, "failed to create containerd container")
    }
// Add container into container store.
    if err := c.containerStore.Add(container); err != nil {
      return nil, errors.Wrapf(err, "failed to add container %q into store", id)
    }
}

StartContainer

StartContainer用于启动一个容器,流程如下:

① 读取保存的container元数据

② 读取关联的sandbox信息

③ 为容器创建task

④ 启动task

代码见containerd/pkg/cri/server/container_start.go ,下面是该部分省略过后的代码:

代码语言:javascript
复制
func (c *criService) StartContainer(ctx context.Context, r *runtime.StartContainerRequest) (retRes *runtime.StartContainerResponse, retErr error) {

    cntr, err := c.containerStore.Get(r.GetContainerId())
   // Get sandbox config from sandbox store. sandbox, err := c.sandboxStore.Get(meta.SandboxID) ctrInfo, err := container.Info(ctx)
    if err != nil {
        return nil, errors.Wrap(err, "failed to get container info")
    }

    taskOpts := c.taskOpts(ctrInfo.Runtime.Name)
    task, err := container.NewTask(ctx, ioCreation, taskOpts...) 
    if err != nil {
        return nil, errors.Wrap(err, "failed to create containerd task")
      }
    // wait is a long running background request, no timeout needed. exitCh, err := task.Wait(ctrdutil.NamespacedContext())
    // Start containerd task.
    if err := task.Start(ctx); err != nil {
        return nil, errors.Wrapf(err, "failed to start containerd task %q", id)
     }
}

创建task的代码如下,调用了containerd的客户端的TasksClient,向服务器端发送创建task的请求

代码语言:javascript
复制
func (c *container) NewTask(ctx context.Context, ioCreate cio.Creator, opts
...NewTaskOpts) (_ Task, err error) {
    ......
    request := &tasks.CreateTaskRequest{ 
                  ContainerID: c.id,
                  Terminal:  cfg.Terminal, Stdin:  cfg.Stdin,
                  Stdout:  cfg.Stdout,
                  Stderr:  cfg.Stderr,
                }
     ......
     response, err := c.client.TaskService().Create(ctx, request)
     ......

task启动的代码如下,调用了containerd的客户端的TasksClient,向服务器端发送启动task的请求。

代码语言:javascript
复制
func (t *task) Start(ctx context.Context) error {
  r, err := t.client.TaskService().Start(ctx, &tasks.StartRequest{
    ContainerID: t.id,
  })
  if err != nil {
    if t.io != nil {
      t.io.Cancel()
      t.io.Close()
    }
    return errdefs.FromGRPC(err)
  }
  t.pid = r.Pid
  return nil
}

“Task Service”

Task Service创建task流程

下面是tasks-service处理创建task请求的代码,根据容器运行时创建容器。

代码语言:javascript
复制
func (l *local) Create(ctx context.Context, r *api.CreateTaskRequest, _
...grpc.CallOption) (*api.CreateTaskResponse, error) { container, err := l.getContainer(ctx, r.ContainerID)
......
    rtime, err := l.getRuntime(container.Runtime.Name) 
    if err != nil {
      return nil, err
    }
    _, err = rtime.Get(ctx, r.ContainerID)
    if err != nil && err != runtime.ErrTaskNotExists { 
        return nil, errdefs.ToGRPC(err)
    }
    if err == nil {
        return nil, errdefs.ToGRPC(fmt.Errorf("task %s already exists", r.ContainerID))
    }
    c, err := rtime.Create(ctx, r.ContainerID, opts)
    ......
    return &api.CreateTaskResponse{ 
        ContainerID: r.ContainerID, Pid: c.PID(),
    }, nil

runtime创建容器代码如下,启动了shim并向shim发送创建请求。

代码语言:javascript
复制
func (m *TaskManager) Create(ctx context.Context, id string, opts runtime.CreateOpts) (_ runtime.Task, retErr error) {
    ......
    shim, err := m.startShim(ctx, bundle, id, opts)

    t, err := shim.Create(ctx, opts)
    .....
}

startShim调用shim可执行文件启动了一个service,代码如下:

代码语言:javascript
复制
func (m *TaskManager) startShim(ctx context.Context, bundle *Bundle, id string, opts runtime.CreateOpts) (*shim, error) {
    ......
    b := shimBinary(ctx, bundle, opts.Runtime, m.containerdAddress, m.containerdTTRPCAddress, m.events, m.tasks)
    shim, err := b.Start(ctx, topts, func() { 
        log.G(ctx).WithField("id", id).Info("shim disconnected")
        cleanupAfterDeadShim(context.Background(), id, ns, m.tasks, m.events, b) m.tasks.Delete(ctx, id)
    })
    ......

执行shim命令所使用的可执行文件是containerd-shim-<runtime>-<version> ,比如我们平时使用的运行时类型是io.containerd.runc.v2 ,那么所用的可执行文件就是containerd-shim-runc-v2 ,完整的命令格式是

代码语言:javascript
复制
containerd-shim-runc-v2 -namespace xxxx -address xxxx -publish-binary xxxx -id xxxx start
代码语言:javascript
复制
func (b *binary) Start(ctx context.Context, opts *types.Any, onClose func()) (_ *shim, err error) {
    args := []string{"­id", b.bundle.ID}

    args = append(args, "start")

    cmd, err := 
        client.Command(ctx,
        b.runtime, 
        b.containerdAddress, 
        b.containerdTTRPCAddress, 
        b.bundle.Path,
        opts, 
        args...,
    )
    ......
    out, err := cmd.CombinedOutput() 
    if err != nil {
        return nil, errors.Wrapf(err, "%s", out)
    }
    address := strings.TrimSpace(string(out))
    conn, err := client.Connect(address, client.AnonDialer) 
    if err != nil {
        return nil, err
    }
    onCloseWithShimLog := func() { 
        onClose()
        cancelShimLog() 
        f.Close()
    }
   client := ttrpc.NewClient(conn, ttrpc.WithOnClose(onCloseWithShimLog))
  return &shim{
    bundle: b.bundle,
    client: client,
  }, nil

Task Service启动task流程

下面是tasks-service启动一个task的流程:

代码语言:javascript
复制
func (l *local) Start(ctx context.Context, r *api.StartRequest, _ ...grpc.CallOption) (*api.StartResponse, error) {
  t, err := l.getTask(ctx, r.ContainerID)
  if err != nil {
    return nil, err
  }
  p := runtime.Process(t)
  if r.ExecID != "" {
    if p, err = t.Process(ctx, r.ExecID); err != nil {
      return nil, errdefs.ToGRPC(err)
    }
  }
  if err := p.Start(ctx); err != nil {
    return nil, errdefs.ToGRPC(err)
  }
  state, err := p.State(ctx)
  if err != nil {
    return nil, errdefs.ToGRPC(err)
  }
  return &api.StartResponse{
    Pid: state.Pid,
  }, nil
}

启动容器的进程通过向shim的server端发送请求完成。

代码语言:javascript
复制
func (s *shim) Start(ctx context.Context) error { 
    response, err := s.task.Start(ctx, 
      &task.StartRequest{
          ID: s.ID(),
      })
    if err != nil {
        return errdefs.FromGRPC(err)
      }
    s.taskPid = int(response.Pid) 
    return nil

“Containerd-shim启动流程”

containerd/runtime/v2/shim/shim.go 中

代码语言:javascript
复制
RunManager(ctx context.Context, manager Manager, opts ...BinaryOpts)

是containerd-shim-runc-v2 start 的代码入口:

代码语言:javascript
复制
case "start":
    opts := StartOpts{
      Address:      addressFlag,
      TTRPCAddress: ttrpcAddress,
      Debug:        debugFlag,
    }

    address, err := manager.Start(ctx, id, opts)
    if err != nil {
      return err
    }
    if _, err := os.Stdout.WriteString(address); err != nil {
      return err
    }
    return nil
  }

containerd-shim-runc-v2 start进程会再次创建一个containerd-shim-runc-v2 -namespace xxxx -id xxxx - address xxxx 的进程用于启动shim server。

代码语言:javascript
复制
func (manager) Start(ctx context.Context, id string, opts shim.StartOpts) (_ string, retErr error) {
  cmd, err := newCommand(ctx, id, opts.Address, opts.TTRPCAddress, opts.Debug)
  ...
  // make sure that reexec shim-v2 binary use the value if need
  if err := shim.WriteAddress("address", address); err != nil {
    return "", err
  }
  ...
  if err := cmd.Start(); err != nil {
    f.Close()
    return "", err
  }
  ...
  // make sure to wait after start
  go cmd.Wait()
  ...
  
  server, err := newServer(ttrpc.WithUnaryServerInterceptor(unaryInterceptor))
  if err != nil {
    return fmt.Errorf("failed creating server: %w", err)
  }

  for _, srv := range ttrpcServices {
    if err := srv.RegisterTTRPC(server); err != nil {
      return fmt.Errorf("failed to register service: %w", err)
    }
  }

  if err := serve(ctx, server, signals, sd.Shutdown); err != nil {
    if err != shutdown.ErrShutdown {
      return err
    }
  }
 }

shim server是个ttrpc服务,提供如下接口:

代码语言:javascript
复制
type TaskService interface {
  State(context.Context, *StateRequest) (*StateResponse, error)
  Create(context.Context, *CreateTaskRequest) (*CreateTaskResponse, error)
  Start(context.Context, *StartRequest) (*StartResponse, error)
  Delete(context.Context, *DeleteRequest) (*DeleteResponse, error)
  Pids(context.Context, *PidsRequest) (*PidsResponse, error)
  Pause(context.Context, *PauseRequest) (*emptypb.Empty, error)
  Resume(context.Context, *ResumeRequest) (*emptypb.Empty, error)
  Checkpoint(context.Context, *CheckpointTaskRequest) (*emptypb.Empty, error)
  Kill(context.Context, *KillRequest) (*emptypb.Empty, error)
  Exec(context.Context, *ExecProcessRequest) (*emptypb.Empty, error)
  ResizePty(context.Context, *ResizePtyRequest) (*emptypb.Empty, error)
  CloseIO(context.Context, *CloseIORequest) (*emptypb.Empty, error)
  Update(context.Context, *UpdateTaskRequest) (*emptypb.Empty, error)
  Wait(context.Context, *WaitRequest) (*WaitResponse, error)
  Stats(context.Context, *StatsRequest) (*StatsResponse, error)
  Connect(context.Context, *ConnectRequest) (*ConnectResponse, error)
  Shutdown(context.Context, *ShutdownRequest) (*emptypb.Empty, error)
}

创建task是执行了runc create --bundle xxxx xxxx 命令,参考代码:

代码语言:javascript
复制
func (r *Runc) Create(context context.Context, id, bundle string, opts *CreateOpts) error
{
    args := []string{"create", "­­bundle", bundle}
    ......
    cmd := r.command(context, append(args, id)...)
    ......
    ec, err := Monitor.Start(cmd)
    ......
}

启动task是执行了runc start xxxx 命令,参考代码:

代码语言:javascript
复制
func (r *Runc) Start(context context.Context, id string) error { 
    return r.runOrError(r.command(context, "start", id))
}

小结

kubelet创建sandbox流程总结如下:

① containerd的cri模块创建sandbox元数据并保存

② containerd的cri模块创建sandbox容器并保存

③ containerd的cri模块通过grpc调用tasks-service创建task

④ tasks-service模块创建containerd-shim-xxxx-xxxx start 进程

⑤ containerd-shim-xxxx-xxxx start 进程创建containerd-shim- xxxx-xxxx 进程并退出

⑥ containerd-shim-xxxx-xxxx 进程启动shim server,提供ttrpc服务

⑦ tasks-service模块调用shim server的Create接口,创建task,shim server 执行runc create 命令

⑧ containerd的cri模块通过grpc调用tasks-service启动task

⑨ tasks-service模块调用shimserver的Start接口,启动task,shim server 执行runc start命令

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2022-12-28,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 DCOS 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • kubelet调用CRI接口创建一个包含A和B两个业务container的Pod流程如下所示:
  • CreateContainer在指定的PodSandbox中创建一个新的container元数据,流程如下:
  • StartContainer用于启动一个容器,流程如下:
  • 创建task的代码如下,调用了containerd的客户端的TasksClient,向服务器端发送创建task的请求
  • “Task Service”
  • 下面是tasks-service启动一个task的流程:
  • “Containerd-shim启动流程”
  • shim server是个ttrpc服务,提供如下接口:
  • 小结
相关产品与服务
容器服务
腾讯云容器服务(Tencent Kubernetes Engine, TKE)基于原生 kubernetes 提供以容器为核心的、高度可扩展的高性能容器管理服务,覆盖 Serverless、边缘计算、分布式云等多种业务部署场景,业内首创单个集群兼容多种计算节点的容器资源管理模式。同时产品作为云原生 Finops 领先布道者,主导开源项目Crane,全面助力客户实现资源优化、成本控制。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档