上文: https://cloud.tencent.com/developer/article/2531005
RMA and atomic operations can both read and write memory that is owned by a peer process, and neither require the involvement of the target processor. Because the memory can be modified over the network, an application must opt into exposing its memory to peers. This is handled by the memory registration process. Registered memory regions associate memory buffers with permissions granted for access by fabric resources. A memory buffer must be registered before it can be used as the target of a remote RMA or atomic data transfer. Additionally, a fabric provider may require that data buffers be registered before being used in local transfers. The latter is necessary to ensure that the virtual to physical page mappings do not change.
Although there are a few different attributes that apply to memory registration, OFI groups those attributes into one of two different modes (for application simplicity).
RMA 和原子操作都可以读取和写入由对等进程拥有的内存,并且都不需要目标处理器的参与。 因为可以通过网络修改内存,所以应用程序必须选择将其内存暴露给对等方。 这是由内存注册过程处理的。 已注册的内存区域将内存缓冲区与授予结构资源访问权限相关联。 必须先注册内存缓冲区,然后才能将其用作远程 RMA 或原子数据传输的目标。 此外,结构提供者可能要求在用于本地传输之前注册数据缓冲区。 后者对于确保虚拟到物理页面的映射不会改变是必要的。
尽管有一些不同的属性适用于内存注册,但 OFI 将这些属性分组为两种不同模式之一(为了应用程序的简单性)。
Basic memory registration mode is defined around supporting the InfiniBand, RoCE, and iWarp architectures, which maps well to a wide variety of RMA capable hardware. In basic mode, registration occurs on allocated memory buffers, and the MR attributes are selected by the provider. The application must only register allocated memory, and the protection keys that are used to access the memory are assigned by the provider. The impact of using basic registration is that the application must inform any peer that wishes to access the region the local virtual address of the memory buffer, along with the key to use when accessing it. Peers must provide both the key and the target's virtual address as part of the RMA operation.
Although not part of the basic memory registration mode definition, hardware that supports this mode frequently requires that all data buffers used for network communication also be registered. This includes buffers posted to send or receive messages, source RMA and atomic buffers, and tagged message buffers. This restriction is indicated using the FI_LOCAL_MR mode bit. This restriction is needed to ensure that the virtual to physical address mappings do not change between a request being submitted and the hardware processing it.
基本内存注册模式是围绕支持 InfiniBand、RoCE 和 iWarp 架构定义的,这些架构很好地映射到各种支持 RMA 的硬件。在基本模式下,注册发生在分配的内存缓冲区上,MR 属性由提供者选择。应用程序必须只注册分配的内存,并且用于访问内存的保护密钥由提供程序分配。使用基本注册的影响是应用程序必须通知任何希望访问该区域的对等方内存缓冲区的本地虚拟地址,以及访问它时使用的密钥。对等方必须同时提供密钥和目标的虚拟地址作为 RMA 操作的一部分。
虽然不是基本内存注册模式定义的一部分,但支持此模式的硬件经常要求也注册用于网络通信的所有数据缓冲区。这包括为发送或接收消息而发布的缓冲区、source RMA 和原子缓冲区以及标记的消息缓冲区。此限制使用 FI_LOCAL_MR 模式位指示。需要此限制以确保虚拟到物理地址的映射不会在提交的请求和处理它的硬件之间发生变化。
Scalable memory registration targets highly parallel, high-performance applications. Such applications often have an additional level of security that allows the peers to operate in a more trusted environment where memory registration is employed. In scalable mode, registration occurs on memory address ranges, and the MR attributes are selected by the user. There are two notable differences with scalable mode.
First is that the address ranges do not need to map to allocated memory buffers at the time the registration call is made. (Virtual memory must back the ranges before they are accessed as part of any data transfer operation.) This allows, for example, for an application to expose all or a significant portion of its address space to peers. When combined with a symmetric memory allocator, this feature can eliminate a process from needing to store the target addresses of its peers. Second, the application selects the protection key for the region. Target addresses and keys can be hard-coded or determined algorithmically, reducing the memory footprint and avoiding network traffic associated with registration.
可扩展内存注册针对高度并行、高性能的应用程序。此类应用程序通常具有额外的安全级别,允许对等方在使用内存注册的更受信任的环境中运行。在可扩展模式下,注册发生在内存地址范围内,MR 属性由用户选择。可扩展模式有两个显着差异。
首先是地址范围不需要在注册调用时映射到分配的内存缓冲区。 (在作为任何数据传输操作的一部分访问范围之前,虚拟内存必须支持这些范围。)这允许,例如,应用程序将其地址空间的全部或大部分公开给对等方。当与对称内存分配器结合使用时,此功能可以消除进程需要存储其对等方的目标地址。其次,应用程序选择区域的保护密钥。目标地址和密钥可以硬编码或通过算法确定,从而减少内存占用并避免与注册相关的网络流量。
The following APIs highlight how to allocate and access a registered memory region. Note that this is not a complete list of memory region (MR) calls, and for full details on each API, readers should refer directly to the man pages.
以下 API 重点介绍了如何分配和访问已注册的内存区域。 请注意,这不是内存区域 (MR) 调用的完整列表,有关每个 API 的完整详细信息,读者应直接参考手册页。
int fi_mr_reg(struct fid_domain *domain, const void *buf, size_t len,
uint64_t access, uint64_t offset, uint64_t requested_key, uint64_t flags,
struct fid_mr **mr, void *context);
void * fi_mr_desc(struct fid_mr *mr);
uint64_t fi_mr_key(struct fid_mr *mr);By default, memory regions are associated with a domain. A MR is accessible by any endpoint that is opened on that domain. A region starts at the address specified by 'buf', and is 'len' bytes long. The 'access' parameter are permission flags that are OR'ed together. The permissions indicate which type of operations may be invoked against the region (e.g. FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE). The 'buf' parameter must point to allocated virtual memory when using basic registration mode.
If scalable registration is used, the application can specify the desired MR key through the 'requested_key' parameter. The 'offset' and 'flags' parameters are not used and reserved for future use.
A MR is associated with local and remote protection keys. The local key is referred to as a memory descriptor and may be retrieved by calling fi_mr_desc(). This call is only needed if the FI_LOCAL_MR mode bit has been set. The memory descriptor is passed directly into data transfer operations, for example:
默认情况下,内存区域与域相关联。在该域上打开的任何端点都可以访问 MR。区域从“buf”指定的地址开始,长度为“len”字节。 'access' 参数是 OR'ed 在一起的权限标志。权限指示可以针对区域调用哪种类型的操作(例如 FI_READ、FI_WRITE、FI_REMOTE_READ、FI_REMOTE_WRITE)。使用基本注册模式时,“buf”参数必须指向分配的虚拟内存。
如果使用可扩展注册,应用程序可以通过“requested_key”参数指定所需的 MR 密钥。 'offset' 和 'flags' 参数未使用,保留供将来使用。
MR 与本地和远程保护密钥相关联。本地密钥称为内存描述符,可以通过调用 fi_mr_desc() 来检索。仅当已设置 FI_LOCAL_MR 模式位时才需要此调用。内存描述符直接传递给数据传输操作,例如:
/* fi_mr_desc() example using fi_send() */
fi_send(ep, buf, len, fi_mr_desc(mr), 0, NULL);The remote key, or simply MR key, is used by the peer when targeting the MR with an RMA or atomic operation. If scalable registration is used, the MR key will be the same as the 'requested_key'. Otherwise, it is a provider selected value. The key must be known to the peer. If basic registration is used, this means that the key will need to be sent in a separate message to the initiating peer. (Some applications exchange the key as part of connection setup).
The API is designed to handle MR keys that are at most 64-bits long. The size of the actual key is reported as a domain attribute. Typical sizes are either 32 or 64 bits, depending on the underlying fabric. Support for keys larger than 64-bits is possible but requires using extended calls not discussed here.
当使用 RMA 或原子操作瞄准 MR 时,对等方使用远程密钥或简称为 MR 密钥。 如果使用可扩展注册,则 MR 密钥将与“requested_key”相同。 否则,它是提供者选择的值。 对等方必须知道密钥。 如果使用基本注册,这意味着需要在单独的消息中将密钥发送给发起对等方。 (一些应用程序交换密钥作为连接设置的一部分)。
该 API 旨在处理最多 64 位长的 MR 密钥。 实际密钥的大小被报告为域属性。 典型大小为 32 位或 64 位,具体取决于底层结构。 支持大于 64 位的密钥是可能的,但需要使用此处未讨论的扩展调用。
https://ofiwg.github.io/libfabric/v1.14.1/man/fi_endpoint.3.html
Endpoints are transport level communication portals. Opening an endpoint is trivial after calling fi_getinfo(), however, there are different open calls, depending on the type of endpoint to allocate. There are separate calls to open active, passive, and scalable endpoints.
端点是传输级通信门户。 在调用 fi_getinfo() 后打开端点是微不足道的,但是,根据要分配的端点的类型,有不同的打开调用。 对打开主动、被动和可扩展端点有单独的调用。
Active endpoints may be connection-oriented or connection-less. The data transfer interfaces – messages (fi_msg), tagged messages (fi_tagged), RMA (fi_rma), and atomics (fi_atomic) – are associated with active endpoints. In basic configurations, an active endpoint has transmit and receive queues. In general, operations that generate traffic on the fabric are posted to the transmit queue. This includes all RMA and atomic operations, along with sent messages and sent tagged messages. Operations that post buffers for receiving incoming data are submitted to the receive queue.
Active endpoints are created in the disabled state. They must transition into an enabled state before accepting data transfer operations, including posting of receive buffers. The fi_enable call is used to transition an active endpoint into an enabled state. The fi_connect and fi_accept calls will also transition an endpoint into the enabled state, if it is not already enabled. An endpoint may immediately be allocated after opening a domain, using the same fi_info structure that was returned from fi_getinfo().
活动端点可能是面向连接的或无连接的。数据传输接口——消息(fi_msg)、标记消息(fi_tagged)、RMA(fi_rma)和原子(fi_atomic)——与活动端点相关联。在基本配置中,活动端点具有发送和接收队列。通常,在结构上生成流量的操作会发布到传输队列。这包括所有 RMA 和原子操作,以及发送的消息和发送的标记消息。为接收传入数据而发布缓冲区的操作将提交到接收队列。
活动端点是在禁用状态下创建的。在接受数据传输操作(包括接收缓冲区的发布)之前,它们必须转换为启用状态。 fi_enable 调用用于将活动端点转换为启用状态。 fi_connect 和 fi_accept 调用还将端点转换为启用状态(如果尚未启用)。端点可以在打开域后立即分配,使用从 fi_getinfo() 返回的相同 fi_info 结构。
int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
struct fid_ep **ep, void *context);In order to transition an endpoint into an enabled state, it must be bound to one or more fabric resources. An endpoint that will generate asynchronous completions, either through data transfer operations or communication establishment events, must be bound to appropriate completion queues or event queues, respectively, before being enabled. Unconnected endpoints must be bound to an address vector.
为了将端点转换为启用状态,它必须绑定到一个或多个结构资源。 将通过数据传输操作或通信建立事件生成异步完成的端点必须在启用之前分别绑定到适当的完成队列或事件队列。 无连接的端点必须绑定到地址向量。
/* Example to enable an unconnected endpoint */
/* Allocate an address vector and associated it with the endpoint */
fi_av_open(domain, &av_attr, &av, NULL);
fi_ep_bind(ep, &av->fid, 0);
/* Allocate and associate completion queues with the endpoint */
fi_cq_open(domain, &tx_cq_attr, &tx_cq, NULL);
fi_ep_bind(ep, &tx_cq->fid, FI_TRANSMIT);
fi_cq_open(domain, &rx_cq_attr, &rx_cq, NULL);
fi_ep_bind(ep, &rx_cq->fid, FI_RECV);
fi_enable(ep);In the above example, we allocate an address vector and transmit and receive completion queues. The attributes for the address vector and completion queue are omitted (additional discussion below). Those are then associated with the endpoint through the fi_ep_bind() call. After all necessary resources have been assigned to the endpoint, we enable it. Enabling the endpoint indicates to the provider that it should allocate any hardware and software resources and complete the initialization for the endpoint.
The fi_enable() call is always called for unconnected endpoints. Connected endpoints may be able to skip calling fi_enable(), since fi_connect() and fi_accept() will enable the endpoint automatically. However, applications may still call fi_enable() prior to calling fi_connect() or fi_accept(). Doing so allows the application to post receive buffers to the endpoint, which ensures that they are available to receive data in the case where the peer endpoint sends messages immediately after it establishes the connection.
在上面的例子中,我们分配了一个地址向量和发送和接收完成队列。地址向量和完成队列的属性被省略(下面有更多讨论)。然后通过 fi_ep_bind() 调用将它们与端点相关联。在将所有必要的资源分配给端点之后,我们启用它。启用端点向提供者表明它应该分配任何硬件和软件资源并完成端点的初始化。
始终为未连接的端点调用 fi_enable() 调用。连接的端点可以跳过调用 fi_enable(),因为 fi_connect() 和 fi_accept() 将自动启用端点。但是,应用程序仍然可以在调用 fi_connect() 或 fi_accept() 之前调用 fi_enable()。这样做允许应用程序将接收缓冲区发布到端点,从而确保在对等端点在建立连接后立即发送消息的情况下它们可以接收数据。
Passive endpoints are used to listen for incoming connection requests. Passive endpoints are of type FI_EP_MSG, and may not perform any data transfers. An application wishing to create a passive endpoint typically calls fi_getinfo() using the FI_SOURCE flag, often only specifying a 'service' address. The service address corresponds to a TCP port number.
Passive endpoints are associated with event queues. Event queues report connection requests from peers. Unlike active endpoints, passive endpoints are not associated with a domain. This allows an application to listen for connection requests across multiple domains.
被动端点用于侦听传入的连接请求。 被动端点的类型为 FI_EP_MSG,并且可能不执行任何数据传输。 希望创建被动端点的应用程序通常使用 FI_SOURCE 标志调用 fi_getinfo(),通常只指定“服务”地址。 服务地址对应一个 TCP 端口号。
被动端点与事件队列相关联。 事件队列报告来自对等方的连接请求。 与主动端点不同,被动端点不与域相关联。 这允许应用程序跨多个域侦听连接请求。
/* Example passive endpoint listen */
fi_passive_ep(fabric, info, &pep, NULL);
fi_eq_open(fabric, &eq_attr, &eq, NULL);
fi_pep_bind(pep, &eq->fid, 0);
fi_listen(pep);A passive endpoint must be bound to an event queue before calling listen. This ensures that connection requests can be reported to the application. To accept new connections, the application waits for a request, allocates a new active endpoint for it, and accepts the request.
在调用listen 之前,被动端点必须绑定到事件队列。 这确保可以将连接请求报告给应用程序。 为了接受新连接,应用程序等待请求,为其分配一个新的活动端点,然后接受请求。
/* Example accepting a new connection */
/* Wait for a CONNREQ event */
fi_eq_sread(eq, &event, &cm_entry, sizeof cm_entry, -1, 0);
assert(event == FI_CONNREQ);
/* Allocate an new endpoint for the connection */
if (!cm_entry.info->domain_attr->domain)
fi_domain(fabric, cm_entry.info, &domain, NULL);
fi_endpoint(domain, cm_entry.info, &ep, NULL);
/* See the resource binding section below for details on associated fabric objects */
fi_ep_bind(ep, &eq->fid, 0);
fi_cq_open(domain, &tx_cq_attr, &tx_cq, NULL);
fi_ep_bind(ep, &tx_cq->fid, FI_TRANSMIT);
fi_cq_open(domain, &rx_cq_attr, &rx_cq, NULL);
fi_ep_bind(ep, &rx_cq->fid, FI_RECV);
fi_enable(ep);
fi_recv(ep, rx_buf, len, NULL, 0, NULL);
fi_accept(ep, NULL, 0);
fi_eq_sread(eq, &event, &cm_entry, sizeof cm_entry, -1, 0);
assert(event == FI_CONNECTED);The connection request event (FI_CONNREQ) includes information about the type of endpoint to allocate, including default attributes to use. If a domain has not already been opened for the endpoint, one must be opened. Then the endpoint and related resources can be allocated. Unlike the unconnected endpoint example above, a connected endpoint does not have an AV, but does need to be bound to an event queue. In this case, we use the same EQ as the listening endpoint. Once the other EP resources (e.g. CQs) have been allocated and bound, the EP can be enabled.
To accept the connection, the application calls fi_accept(). Note that because of thread synchronization issues, it is possible for the active endpoint to receive data even before fi_accept() can return. The posting of receive buffers prior to calling fi_accept() handles this condition, which avoids network flow control issues occurring immediately after connecting.
The fi_eq_sread() calls are blocking (synchronous) read calls to the event queue. These calls wait until an event occurs, which in this case are connection request and establishment events.
连接请求事件 (FI_CONNREQ) 包括有关要分配的端点类型的信息,包括要使用的默认属性。如果尚未为端点打开域,则必须打开一个域。然后可以分配端点和相关资源。与上面未连接的端点示例不同,已连接的端点没有 AV,但需要绑定到事件队列。在这种情况下,我们使用与监听端点相同的 EQ。一旦其他 EP 资源(例如 CQ)已经分配和绑定,EP 就可以启用。
为了接受连接,应用程序调用 fi_accept()。请注意,由于线程同步问题,活动端点甚至可能在 fi_accept() 返回之前接收数据。在调用 fi_accept() 之前发布接收缓冲区可以处理这种情况,从而避免在连接后立即发生网络流量控制问题。
fi_eq_sread() 调用是对事件队列的阻塞(同步)读取调用。这些调用一直等到事件发生,在这种情况下是连接请求和建立事件。
For most applications, an endpoint consists of a transmit and receive context associated with a single address. The transmit and receive contexts often map to hardware command queues. For multi-threaded applications, access to these hardware queues requires serialization, which can lead to them becoming bottlenecks. Scalable endpoints were created to address this.
A scalable endpoint is an endpoint that has multiple transmit and/or receive contexts associated with it. As an example, consider an application that allocates a total of four processing threads. By assigning each thread its own transmit context, the application can avoid serializing (i.e. locking) access to hardware queues.
The advantage of using a scalable endpoint over allocating multiple traditional endpoints is reduced addressing footprint. A scalable endpoint has a single address, regardless of how many transmit or receive contexts it may have.
Support for scalable endpoints is provider specific, with support indicated by the domain attributes:
对于大多数应用程序,端点由与单个地址关联的发送和接收上下文组成。发送和接收上下文通常映射到硬件命令队列。对于多线程应用程序,访问这些硬件队列需要序列化,这可能导致它们成为瓶颈。创建了可扩展的端点来解决这个问题。
可扩展端点是具有与其关联的多个发送和/或接收上下文的端点。例如,考虑一个总共分配四个处理线程的应用程序。通过为每个线程分配自己的传输上下文,应用程序可以避免对硬件队列的序列化(即锁定)访问。
与分配多个传统端点相比,使用可扩展端点的优势在于减少了寻址空间。一个可扩展的端点有一个地址,不管它可能有多少发送或接收上下文。
对可扩展端点的支持是特定于提供者的,支持由域属性指示:
struct fi_domain_attr {
...
size_t max_ep_tx_ctx;
size_t max_ep_rx_ctx;
...The above fields indicates the maximum number of transmit and receive contexts, respectively, that may be associated with a single endpoint. One or both of these values will be greater than one if scalable endpoints are supported. Applications can configure and allocate a scalable endpoint using the fi_scalable_ep call:
上述字段分别指示可能与单个端点相关联的发送和接收上下文的最大数量。 如果支持可扩展端点,则这些值中的一个或两个都将大于 1。 应用程序可以使用 fi_scalable_ep 调用配置和分配可扩展端点:
/* Set the required number of transmit of receive contexts
* These must be <= the domain maximums listed above.
* This will usually be set prior to calling fi_getinfo
*/
struct fi_info *hints, *info;
struct fid_domain *domain;
struct fid_ep *scalable_ep, *tx_ctx[4], *rx_ctx[2];
hints = fi_allocinfo();
...
/* A scalable endpoint requires > 1 Tx or Rx queue */
hints->ep_attr->tx_ctx_cnt = 4;
hints->ep_attr->rx_ctx_cnt = 2;
/* Call fi_getinfo and open fabric, domain, etc. */
...
fi_scalable_ep(domain, info, &sep, NULL);The above example opens an endpoint with four transmit and two receive contexts. However, a scalable endpoint only needs to be scalable in one dimension -- transmit or receive. For example, it could use multiple transmit contexts, but only require a single receive context. It could even use a shared context, if desired.
Submitting data transfer operations to a scalable endpoint is more involved. First, if the endpoint only has a single transmit context, then all transmit operations are posted directly to the scalable endpoint, the same as if a traditional endpoint were used. Likewise, if the endpoint only has a single receive context, then all receive operations are posted directly to the scalable endpoint. An additional step is needed before posting operations to one of many contexts, that is, the 'scalable' portion of the endpoint. The desired context must first be retrieved:
上面的示例打开了一个具有四个发送和两个接收上下文的端点。 然而,一个可扩展的端点只需要在一个维度上是可扩展的——发送或接收。 例如,它可以使用多个传输上下文,但只需要一个接收上下文。 如果需要,它甚至可以使用共享上下文。
将数据传输操作提交到可扩展的端点更为复杂。 首先,如果端点只有一个传输上下文,那么所有传输操作都直接发布到可扩展端点,就像使用传统端点一样。 同样,如果端点只有一个接收上下文,则所有接收操作都直接发布到可扩展端点。 在将操作发布到许多上下文之一之前,需要一个额外的步骤,即端点的“可扩展”部分。 必须首先检索所需的上下文:
/* Retrieve the first (index 0) transmit and receive contexts */
fi_tx_context(scalable_ep, 0, info->tx_attr, &tx_ctx[0], &tx_ctx[0]);
fi_rx_context(scalable_ep, 0, info->rx_attr, &rx_ctx[0], &rx_ctx[0]);Data transfer operations are then posted to the tx_ctx or rx_ctx. It should be noted that although the scalable endpoint, transmit context, and receive context are all of type fid_ep, attempting to submit a data transfer operation against the wrong object will result in an error.
By default all transmit and receive contexts belonging to a scalable endpoint are similar with respect to other transmit and receive contexts. However, applications can request that a context have fewer capabilities than what was requested for the scalable endpoint. This allows the provider to configure its hardware resources for optimal performance. For example, suppose a scalable endpoint has been configured for tagged message and RMA support. An application can open a transmit context with only tagged message support, and another context with only RMA support.
然后将数据传输操作发布到 tx_ctx 或 rx_ctx。 需要注意的是,虽然可扩展端点、传输上下文和接收上下文都是 fid_ep 类型,但尝试针对错误对象提交数据传输操作会导致错误。
默认情况下,属于可扩展端点的所有发送和接收上下文与其他发送和接收上下文相似。 但是,应用程序可以请求上下文具有比可扩展端点请求的更少的功能。 这允许提供商配置其硬件资源以获得最佳性能。 例如,假设已为标记消息和 RMA 支持配置了可扩展端点。 应用程序可以打开仅支持标记消息的传输上下文,以及仅支持 RMA 的另一个上下文。
Before an endpoint can be used for data transfers, it must be associated with other resources, such as completion queues, counters, address vectors, or event queues. Resource bindings must be done prior to enabling an endpoint. All active endpoints must be bound to completion queues. Unconnected endpoints must be associated with an address vector. Passive and connection-oriented endpoints must be bound to an event queue. The resource binding requirements are cumulative: for example, an RDM endpoint must be bound to completion queues and address vectors.
As shown in previous examples, resources are associated with endpoints using a bind operation:
在端点可以用于数据传输之前,它必须与其他资源相关联,例如完成队列、计数器、地址向量或事件队列。 资源绑定必须在启用端点之前完成。 所有活动端点必须绑定到完成队列。 未连接的端点必须与地址向量相关联。 被动和面向连接的端点必须绑定到事件队列。 资源绑定要求是累积的:例如,RDM 端点必须绑定到完成队列和地址向量。
如前面的示例所示,资源使用绑定操作与端点相关联:
int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);
int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);
int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);The bind functions are similar to each other (and map to the same fi_bind call internally). Flags are used to indicate how the resources should be associated. The passive endpoint section above shows an example of binding passive and active endpoints to event and completion queues.
绑定函数彼此相似(并在内部映射到相同的 fi_bind 调用)。 标志用于指示资源应如何关联。 上面的被动端点部分显示了将被动和主动端点绑定到事件和完成队列的示例。
The properties of an endpoint are specified using endpoint attributes. These may be set as hints passed into the fi_getinfo call. Unset values will be filled out by the provider.
端点的属性是使用端点属性指定的。 这些可以设置为传递给 fi_getinfo 调用的提示。 未设置的值将由提供者填写。
struct fi_ep_attr {
enum fi_ep_type type;
uint32_t protocol;
uint32_t protocol_version;
size_t max_msg_size;
size_t msg_prefix_size;
size_t max_order_raw_size;
size_t max_order_war_size;
size_t max_order_waw_size;
uint64_t mem_tag_format;
size_t tx_ctx_cnt;
size_t rx_ctx_cnt;
};A full description of each field is available in the libfabric man pages, with selected details listed below.
libfabric 手册页中提供了每个字段的完整描述,下面列出了选定的详细信息。
This indicates the type of endpoint: reliable datagram (FI_EP_RDM), reliable-connected (FI_EP_MSG), or unreliable datagram (DGRAM). Nearly all applications will need to specify the endpoint type as a hint passed into fi_getinfo, as most applications will only be coded to support a single endpoint type.
这表示端点的类型:可靠数据报rdm (FI_EP_RDM) mercury(hg)默认使用的端点类型为可靠数据报、可靠连接 (FI_EP_MSG) 或不可靠数据报 (DGRAM)。 几乎所有应用程序都需要将端点类型指定为传递给 fi_getinfo 的提示,因为大多数应用程序只会被编码为支持单一端点类型。
This size is the maximum size for any data transfer operation that goes over the endpoint. For unreliable datagram endpoints, this is often the MTU of the underlying network. For reliable endpoints, this value is often a restriction of the underlying transport protocol. Applications that require transfers larger than the maximum reported size are required to break up a single, large transfer into multiple operations.
Providers expose their hardware or network limits to the applications, rather than segmenting large transfers internally, in order to minimize completion overhead. For example, for a provider to support large message segmentation internally, it would need to emulate all completion mechanisms (queues and counters) in software, even if transfers that are larger than the transport supported maximum were never used.
此大小是通过端点的任何数据传输操作的最大大小。 对于不可靠的数据报端点,这通常是底层网络的 MTU。 对于可靠的端点,这个值通常是底层传输协议的限制。 需要传输大于最大报告大小的应用程序需要将单个大传输分解为多个操作。
提供者将他们的硬件或网络限制暴露给应用程序,而不是在内部分割大量传输,以最大限度地减少完成开销。 例如,对于要在内部支持大型消息分段的提供程序,它需要在软件中模拟所有完成机制(队列和计数器),即使从未使用过大于传输支持的最大值的传输。
This field specifies data ordering. It defines the delivery order of transport data into target memory for RMA and atomic operations. Data ordering requires message ordering.
For example, suppose that an application issues two RMA write operations to the same target memory location. (The application may be writing a time stamp value every time a local condition is met, for instance). Message ordering indicates that the first write as initiated by the sender is the first write processed by the receiver. Data ordering indicates whether the data from the first write updates memory before the second write updates memory.
The max_order_xxx_size fields indicate how large a message may be while still achieving data ordering. If a field is 0, then no data ordering is guaranteed. If a field is the same as the max_msg_size, then data order is guaranteed for all messages.
It is common for providers to support data ordering up to max_msg_size for back to back operations that are the same. For example, an RMA write followed by an RMA write may have data ordering regardless of the size of the data transfer (max_order_waw_size = max_msg_size). Mixed operations, such as a read followed by a write, are often more restricted. This is because RMA read operations may require acknowledgments from the initiator, which impacts the re-transmission protocol.
For example, consider an RMA read followed by a write. The target will process the read request, retrieve the data, and send a reply. While that is occurring, a write is received that wants to update the same memory location accessed by the read. If the target processes the write, it will overwrite the memory used by the read. If the read response is lost, and the read is retried, the target will be unable to re-send the data. To handle this, the target either needs to: defer handling the write until it receives an acknowledgment for the read response, buffer the read response so it can be re-transmitted, or indicate that data ordering is not guaranteed.
Because the read or write operation may be gigabytes in size, deferring the write may add significant latency, and buffering the read response may be impractical. The max_order_xxx_size fields indicate how large back to back operations may be with ordering still maintained. In many cases, read after write and write and read ordering may be significantly limited, but still usable for implementing specific algorithms, such as a global locking mechanism.
此字段指定数据排序。它为 RMA 和原子操作定义了传输数据到目标内存的传递顺序。数据排序需要消息排序。
例如,假设一个应用程序向同一个目标内存位置发出两个 RMA 写操作。 (例如,应用程序可能会在每次满足本地条件时写入时间戳值)。消息排序表明发送方发起的第一个写入是接收方处理的第一个写入。数据排序指示第一次写入的 data 是否在第二次写入更新内存之前更新内存。
max_order_xxx_size 字段指示在仍实现数据排序的同时消息可能有多大。如果字段为 0,则不保证数据排序。如果某个字段与 max_msg_size 相同,则保证所有消息的数据顺序。
对于相同的背靠背操作,提供者通常支持最大为 max_msg_size 的数据排序。例如,RMA 写入后跟 RMA 写入可能具有数据排序,而与数据传输的大小无关 (max_order_waw_size = max_msg_size)。混合操作,例如先读后写,通常受到更多限制。这是因为 RMA 读取操作可能需要来自 initiator 的确认,这会影响重新传输协议。
例如,考虑 RMA 读取,然后是写入。目标将处理读取请求、检索数据并发送回复。在发生这种情况时,会收到一个写入,该写入想要更新读取访问的相同内存位置。如果目标处理写入,它将覆盖读取使用的内存。如果读取响应丢失,并且重新尝试读取,则目标将无法重新发送数据。为了处理这个问题,目标要么需要:推迟处理写入,直到它收到对读取响应的确认,缓冲读取响应以便可以重新传输,或者表明数据顺序不能保证。
因为读取或写入操作的大小可能是千兆字节,所以延迟写入可能会增加显着的延迟,并且缓冲读取响应可能是不切实际的。 max_order_xxx_size 字段指示在仍保持排序的情况下背靠背操作可能有多大。在许多情况下,写后读以及写和读顺序可能会受到很大限制,但仍可用于实现特定算法,例如全局锁定机制。
The endpoint attributes define the overall abilities for the endpoint; however, attributes that apply specifically to receive or transmit contexts are defined by struct fi_rx_attr and fi_tx_attr, respectively:
端点属性定义了端点的整体能力; 但是,专门应用于接收或传输上下文的属性分别由 struct fi_rx_attr 和 fi_tx_attr 定义:
struct fi_rx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t total_buffered_recv;
size_t size;
size_t iov_limit;
};
struct fi_tx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t inject_size;
size_t size;
size_t iov_limit;
size_t rma_iov_limit;
};Context capabilities must be a subset of the endpoint capabilities. For many applications, the default attributes returned by the provider will be sufficient, with the application only needing to specify endpoint attributes.
Both context attributes include an op_flags field. This field is used by applications to specify the default operation flags to use with any call. For example, by setting the transmit context’s op_flags to FI_INJECT, the application has indicated to the provider that all transmit operations should assume ‘inject’ behavior is desired. (I.e. the buffer provided to the call must be returned to the application upon return from the function). The op_flags applies to all operations that do not provide flags as part of the call (e.g. fi_sendmsg). A common use of op_flags is to specify the default completion semantic desired (discussed next) by the application.
It should be noted that some attributes are dependent upon the peer endpoint having supporting attributes in order to achieve correct application behavior. For example, message order must be the compatible between the initiator’s transmit attributes and the target’s receive attributes. Any mismatch may result in incorrect behavior that could be difficult to debug.
上下文能力必须是端点能力的子集。对于许多应用程序,提供者返回的默认属性就足够了,应用程序只需要指定端点属性。
两个上下文属性都包含一个 op_flags 字段。应用程序使用此字段来指定用于任何调用的默认操作标志。例如,通过将传输上下文的 op_flags 设置为 FI_INJECT,应用程序已向提供者指示所有传输操作都应假定需要“注入”行为。 (即,提供给调用的缓冲区必须在从函数返回时返回给应用程序)。 op_flags 适用于在调用中不提供标志的所有操作(例如 fi_sendmsg)。 op_flags 的一个常见用途是指定应用程序所需的默认完成语义(接下来讨论)。
应该注意的是,一些属性依赖于具有支持属性的对等端点,以实现正确的应用程序行为。例如,消息顺序必须在发起者的发送属性和目标的接收属性之间兼容。任何不匹配都可能导致难以调试的错误行为。
Data transfer operations complete asynchronously. Libfabric defines two mechanism by which an application can be notified that an operation has completed: completion queues and counters.
Regardless of which mechanism is used to notify the application that an operation is done, developers must be aware of what a completion indicates.
In all cases, a completion indicates that it is safe to reuse the buffer(s) associated with the data transfer. This completion mode is referred to as inject complete and corresponds to the operational flags FI_INJECT_COMPLETE. However, a completion may also guarantee stronger semantics.
Although libfabric does not define an implementation, a provider can meet the requirement for inject complete by copying the application’s buffer into a network buffer before generating the completion. Even if the transmit operation is lost and must be retried, the provider can resend the original data from the copied location. For large transfers, a provider may not mark a request as inject complete until the data has been acknowledged by the target. Applications, however, should only infer that it is safe to reuse their data buffer for an inject complete operation.
Transmit complete is a completion mode that provides slightly stronger guarantees to the application. The meaning of transmit complete depends on whether the endpoint is reliable or unreliable. For an unreliable endpoint (FI_EP_DGRAM), a transmit completion indicates that the request has been delivered to the network. That is, the message has left the local NIC. For reliable endpoints, a transmit complete occurs when the request has reached the target endpoint. Typically, this indicates that the target has acked the request. Transmit complete maps to the operation flag FI_TRANSMIT_COMPLETE.
A third completion mode is defined to provide guarantees beyond transmit complete. With transmit complete, an application knows that the message is no longer dependent on the local NIC or network (e.g. switches). However, the data may be buffered at the remote NIC and has not necessarily been written to the target memory. As a result, data sent in the request may not be visible to all processes. The third completion mode is delivery complete.
Delivery complete indicates that the results of the operation are available to all processes on the fabric. The distinction between transmit and delivery complete is subtle, but important. It often deals with when the target endpoint generates an acknowledgment to a message. For providers that offload transport protocol to the NIC, support for transmit complete is common. Delivery complete guarantees are more easily met by providers that implement portions of their protocol on the host processor. Delivery complete corresponds to the FI_DELIVERY_COMPLETE operation flag.
Applications can request a default completion mode when opening an endpoint by setting one of the above mentioned complete flags as an op_flags for the context’s attributes. However, it is usually recommended that application use the provider’s default flags for best performance, and amend its protocol to achieve its completion semantics. For example, many applications will perform a ‘finalize’ or ‘commit’ procedure as part of their operation, which synchronizes the processing of all peers and guarantees that all previously sent data has been received.
数据传输操作异步完成。 Libfabric 定义了两种机制,通过它可以通知应用程序操作已完成:完成队列和计数器。
无论使用哪种机制来通知应用程序操作已完成,开发人员都必须了解完成指示的内容。
在所有情况下,完成都表明重用与数据传输相关的缓冲区是安全的。这种完成模式称为注入完成,对应于操作标志 FI_INJECT_COMPLETE。然而,补全也可以保证更强的语义。
尽管 libfabric 没有定义实现,但提供者可以通过在生成完成之前将应用程序的缓冲区复制到网络缓冲区来满足注入完成的要求。即使传输操作丢失并且必须重试,提供者也可以从复制的位置重新发送原始数据。对于大型传输,在目标确认数据之前,提供者可能不会将请求标记为注入完成。然而,应用程序应该只推断重用它们的数据缓冲区来完成注入操作是安全的。
传输完成是一种完成模式,它为应用程序提供了稍微更强的保证。传输完成的含义取决于端点是可靠的还是不可靠的。对于不可靠的端点 (FI_EP_DGRAM),传输完成指示请求已被传递到网络。即消息已离开本地网卡。对于可靠端点,当请求到达目标端点时会发生传输完成。通常,这表明目标已确认请求。传输完成映射到操作标志 FI_TRANSMIT_COMPLETE。
定义了第三种完成模式以提供传输完成之外的保证。传输完成后,应用程序知道消息不再依赖于本地 NIC 或网络(例如交换机)。但是,数据可能会在远程 NIC 处缓冲,并且不一定已写入目标内存。因此,请求中发送的数据可能对所有进程都不可见。第三种完成模式是交付完成。
交付完成表示操作的结果可用于结构上的所有进程。传输和交付完成之间的区别是微妙的,但很重要。它通常处理何时目标端点生成对消息的确认。对于将传输协议卸载到 NIC 的提供商,支持传输完成是常见的。在主机处理器上实现部分协议的提供商更容易满足交付完全保证。交付完成对应于 FI_DELIVERY_COMPLETE 操作标志。
通过将上述完成标志之一设置为上下文属性的 op_flags,应用程序可以在打开端点时请求默认完成模式。但是,通常建议应用程序使用提供者的默认标志以获得最佳性能,并修改其协议以实现其完成语义。例如,许多应用程序将执行“finalize”或“commit”过程作为其操作的一部分,这会同步所有对等点的处理并保证已接收到所有先前发送的数据。
Completion queues often map directly to provider hardware mechanisms, and libfabric is designed around minimizing the software impact of accessing those mechanisms. Unlike other objects discussed so far (fabrics, domains, endpoints), completion queues are not part of the fi_info structure or involved with the fi_getinfo() call.
All active endpoints must be bound with one or more completion queues. This is true even if completions will be suppressed by the application (e.g. using the FI_SELECTIVE_COMPLETION flag). Completion queues are needed to report operations that complete in error.
Transmit and receive contexts are each associated with their own completion queue. An endpoint may direct transmit and receive completions to separate CQs or to the same CQ. For applications, using a single CQ reduces system resource utilization. While separating completions to different CQs could simplify code maintenance or improve multi-threading execution. A CQ may be shared among multiple endpoints.
CQs are allocated separately from endpoints and are associated with endpoints through the fi_ep_bind() function.
完成队列通常直接映射到提供者硬件机制,而 libfabric 的设计目的是最大限度地减少访问这些机制对软件的影响。与目前讨论的其他对象(结构、域、端点)不同,完成队列不是 fi_info 结构的一部分,也不涉及 fi_getinfo() 调用。
所有活动端点必须绑定一个或多个完成队列。即使完成将被应用程序抑制(例如使用 FI_SELECTIVE_COMPLETION 标志)也是如此。需要完成队列来报告错误完成的操作。
发送和接收上下文都与它们自己的完成队列相关联。端点可以将发送和接收完成定向到单独的 CQ 或同一个 CQ。对于应用程序,使用单个 CQ 会降低系统资源利用率。虽然将完成分离到不同的 CQ 可以简化代码维护或改进多线程执行。一个 CQ 可以在多个端点之间共享。
CQ 与端点分开分配,并通过 fi_ep_bind() 函数与端点关联。
The properties of a completion queue are specified using the fi_cq_attr structure:
struct fi_cq_attr {
size_t size;
uint64_t flags;
enum fi_cq_format format;
enum fi_wait_obj wait_obj;
int signaling_vector;
enum fi_cq_wait_cond wait_cond;
struct fid_wait *wait_set;
};Select details are described below.
The CQ size is the number of entries that the CQ can store before being overrun. If resource management is disabled, then the application is responsible for ensuring that it does not submit more operations than the CQ can store. When selecting an appropriate size for a CQ, developers should consider the size of all transmit and receive contexts that insert completions into the CQ.
Because CQs often map to hardware constructs, their size may be limited to a pre-set maximum. Applications should be prepared to allocate multiple CQs if they make use of a lot of endpoints –- a connection-oriented server application, for example. Applications should size the CQ correctly to avoid wasting system resources, while still protecting against queue overruns.
In order to minimize the amount of data that a provider must report, the type of completion data written back to the application is select-able. This limits the number of bytes the provider writes to memory, and allows necessary completion data to fit into a compact structure. Each CQ format maps to a specific completion structure. Developers should analyze each structure, select the smallest structure that contains all of the data it requires, and specify the corresponding enum value as the CQ format.
For example, if an application only needs to know which request completed, along with the size of a received message, it can select the following:
CQ 大小是 CQ 在溢出之前可以存储的条目数。如果资源管理被禁用,那么应用程序负责确保它不会提交超过 CQ 可以存储的操作。在为 CQ 选择合适的大小时,开发人员应考虑将完成插入 CQ 的所有发送和接收上下文的大小。
因为 CQ 经常映射到硬件结构,所以它们的大小可能会被限制在一个预设的最大值。如果应用程序使用大量端点——例如面向连接的服务器应用程序,则应准备分配多个 CQ。应用程序应正确调整 CQ 的大小以避免浪费系统资源,同时仍防止队列溢出。
为了最小化提供者必须报告的数据量,写回应用程序的完成数据类型是可选择的。这限制了提供程序写入内存的字节数,并允许必要的完成数据适合紧凑的结构。每个 CQ 格式映射到一个特定的完成结构。开发者应该分析每个结构,选择包含它需要的所有数据的最小结构,并将对应的枚举值指定为CQ格式。
例如,如果应用程序只需要知道完成了哪个请求,以及收到的消息的大小,它可以选择以下内容:
cq_attr->format = FI_CQ_FORMAT_MSG;
struct fi_cq_msg_entry {
void *op_context;
uint64_t flags;
size_t len;
};Once the format has been selected, the underlying provider will assume that read operations against the CQ will pass in an array of the corresponding structure. The CQ data formats are designed such that a structure that reports more information can be cast to one that reports less.
选择格式后,底层提供程序将假定针对 CQ 的读取操作将传入相应结构的数组。 CQ 数据格式的设计使得报告更多信息的结构可以转换为报告更少的结构。
Wait objects are a way for an application to suspend execution until it has been notified that a completion is ready to be retrieved from the CQ. The use of wait objects is recommended over busy waiting (polling) techniques for most applications. CQs include calls (fi_cq_sread() – for synchronous read) that will block until a completion occurs. Applications that will only use the libfabric blocking calls should select FI_WAIT_UNSPEC as their wait object. This allows the provider to select an object that is optimal for its implementation.
Applications that need to wait on other resources, such as open file handles or sockets, can request that a specific wait object be used. The most common alternative to FI_WAIT_UNSPEC is FI_WAIT_FD. This associates a file descriptor with the CQ. The file descriptor may be retrieved from the CQ using an fi_control() operation, and can be passed to standard operating system calls, such as select() or poll().
等待对象是应用程序暂停执行的一种方式,直到它被通知已准备好从 CQ 检索完成。 对于大多数应用程序,建议使用等待对象而不是忙等待(轮询)技术。 CQ 包括调用(fi_cq_sread() – 用于同步读取),这些调用将阻塞直到完成。 仅使用 libfabric 阻塞调用的应用程序应选择 FI_WAIT_UNSPEC 作为其等待对象。 这允许提供者选择最适合其实现的对象。
需要等待其他资源(例如打开的文件句柄或套接字)的应用程序可以请求使用特定的等待对象。 FI_WAIT_UNSPEC 最常见的替代方法是 FI_WAIT_FD。 这将文件描述符与 CQ 相关联。 可以使用 fi_control() 操作从 CQ 检索文件描述符,并且可以将其传递给标准操作系统调用,例如 select() 或 poll()。
Completions may be read from a CQ by using one of the non-blocking calls, fi_cq_read / fi_cq_readfrom, or one of the blocking calls, fi_cq_sread / fi_cq_sreadfrom. Regardless of which call is used, applications pass in an array of completion structures based on the selected CQ format. The CQ interfaces are optimized for batch completion processing, allowing the application to retrieve multiple completions from a single read call. The difference between the read and readfrom calls is that readfrom returns source addressing data, if available. The readfrom derivative of the calls is only useful for unconnected endpoints, and only if the corresponding endpoint has been configured with the FI_SOURCE capability.
FI_SOURCE requires that the provider use the source address available in the raw completion data, such as the packet's source address, to retrieve a matching entry in the endpoint’s address vector. Applications that carry some sort of source identifier as part of their data packets can avoid the overhead associated with using FI_SOURCE.
可以使用非阻塞调用之一 fi_cq_read / fi_cq_readfrom 或阻塞调用之一 fi_cq_sread / fi_cq_sreadfrom 从 CQ 读取完成。无论使用哪个调用,应用程序都会根据所选的 CQ 格式传入完成结构数组。 CQ 接口针对批处理完成进行了优化,允许应用程序从单个读取调用中检索多个完成。 read 和 readfrom 调用之间的区别在于 readfrom 返回源寻址数据(如果可用)。调用的 readfrom 派生仅对未连接的端点有用,并且仅当相应的端点已配置有 FI_SOURCE 功能时。
FI_SOURCE 要求提供者使用原始完成数据中可用的源地址(例如数据包的源地址)来检索端点地址向量中的匹配条目。将某种源标识符作为其数据包的一部分的应用程序可以避免与使用 FI_SOURCE 相关的开销。
Because the selected completion structure is insufficient to report all data necessary to debug or handle an operation that completes in error, failed operations are reported using a separate fi_cq_readerr() function. This call takes as input a CQ error entry structure, which allows the provider to report more information regarding the reason for the failure.
由于所选的完成结构不足以报告调试或处理错误完成的操作所需的所有数据,因此使用单独的 fi_cq_readerr() 函数报告失败的操作。 此调用将 CQ 错误条目结构作为输入,它允许提供者报告有关失败原因的更多信息。
/* read error prototype */
fi_cq_readerr(struct fid_cq *cq, struct fi_cq_err_entry *buf, uint64_t flags);
/* error data structure */
struct fi_cq_err_entry {
void *op_context;
uint64_t flags;
size_t len;
void *buf;
uint64_t data;
uint64_t tag;
size_t olen;
int err;
int prov_errno;
void *err_data;
};
/* Sample error handling */
struct fi_cq_msg_entry entry;
struct fi_cq_err_entry err_entry;
int ret;
ret = fi_cq_read(cq, &entry, 1);
if (ret == -FI_EAVAIL)
ret = fi_cq_readerr(cq, &err_entry, 0);As illustrated, if an error entry has been inserted into the completion queue, then attempting to read the CQ will result in the read call returning -FI_EAVAIL (error available). This indicates that the application must use the fi_cq_readerr() call to remove the failed operation's completion information before other completions can be reaped from the CQ.
A fabric error code regarding the failure is reported as the err field in the fi_cq_err_entry structure. A provider specific error code is also available through the prov_errno field. This field can be decoded into a displayable string using the fi_cq_strerror() routine. The err_data field is provider specific data that assists the provider in decoding the reason for the failure.
如图所示,如果错误条目已插入完成队列,则尝试读取 CQ 将导致读取调用返回 -FI_EAVAIL(错误可用)。 这表明应用程序必须使用 fi_cq_readerr() 调用来删除失败操作的完成信息,然后才能从 CQ 获取其他完成信息。
有关故障的结构错误代码报告为 fi_cq_err_entry 结构中的 err 字段。 提供者特定的错误代码也可通过 prov_errno 字段获得。 可以使用 fi_cq_strerror() 例程将此字段解码为可显示的字符串。 err_data 字段是提供者特定的数据,可帮助提供者解码失败的原因。
Completion counters are conceptually very simple completion mechanisms that return the number of completions that have occurred on an endpoint. No other details about the completion is available. Counters work well for connection-oriented applications that make use of strict completion ordering (rx/tx attribute comp_order = FI_ORDER_STRICT), or applications that need to collect a specific number of responses from peers.
An endpoint has more flexibility with how many counters it can use relative to completion queues. Different types of operations can update separate counters. For instance, sent messages can update one counter, while RMA writes can update another. This allows for simple, yet powerful usage models, as control message completions can be tracked independently from large data transfers. Counters are associated with active endpoints using the fi_ep_bind() call:
完成计数器在概念上是非常简单的完成机制,它返回在端点上发生的完成次数。 没有其他关于完成的详细信息。 计数器适用于使用严格完成排序(rx/tx 属性 comp_order = FI_ORDER_STRICT)的面向连接的应用程序,或需要从对等方收集特定数量的响应的应用程序。
端点相对于完成队列可以使用多少计数器具有更大的灵活性。 不同类型的操作可以更新不同的计数器。 例如,发送的消息可以更新一个计数器,而 RMA 写入可以更新另一个。 这允许简单但功能强大的使用模型,因为可以独立于大数据传输跟踪控制消息的完成。 计数器使用 fi_ep_bind() 调用与活动端点相关联:
/* Example binding a counter to an endpoint.
* The counter will update on completion of any transmit operation.
*/
fi_ep_bind(ep, cntr, FI_SEND | FI_WRITE | FI_READ);Counters are defined such that they can be implemented either in hardware or in software, by layering over a hardware completion queue. Even when implemented in software, counter use can improve performance by reducing the amount of completion data that is reported. Additionally, providers may be able to optimize how a counter is updated, relative to an application counting the same type of events. For example, a provider may be able to compare the head and tail pointers of a queue to determine the total number of completions that are available, allowing a single write to update a counter, rather than repeatedly incrementing a counter variable once for each completion.
通过在硬件完成队列上分层,计数器被定义为可以在硬件或软件中实现。 即使在软件中实现,计数器的使用也可以通过减少报告的完成数据量来提高性能。 此外,相对于计数相同类型事件的应用程序,提供者可能能够优化计数器的更新方式。 例如,提供者可能能够比较队列的头指针和尾指针以确定可用的完成总数,从而允许单次写入更新计数器,而不是每次完成时重复递增计数器变量一次。
Most counter attributes are a subset of the CQ attributes:
struct fi_cntr_attr {
enum fi_cntr_events events;
enum fi_wait_obj wait_obj;
struct fid_wait *wait_set;
uint64_t flags;
};The sole exception is the events field, which must be set to FI_CNTR_EVENTS_COMP, indicating that completion events are being counted. (This field is defined for future extensibility). A completion counter is updated according to the completion model that was selected by the endpoint. For example, if an endpoint is configured to for transmit complete, the counter will not be updated until the transfer has been received by the target endpoint.
唯一的例外是 events 字段,它必须设置为 FI_CNTR_EVENTS_COMP,表示正在计算完成事件。 (该字段是为将来的可扩展性而定义的)。 根据端点选择的完成模型更新完成计数器。 例如,如果端点配置为传输完成,则在目标端点接收到传输之前不会更新计数器。
A completion counter is actually comprised of two different values. One represents the number of operations that complete successfully. The other indicates the number of operations which complete in error. Counters do not provide any additional information about the type of error, nor indicate which operation failed. Details of errors must be retrieved from a completion queue.
Reading a counter’s values is straightforward:
完成计数器实际上由两个不同的值组成。 1 表示成功完成的操作数。 另一个表示错误完成的操作数。 计数器不提供有关错误类型的任何附加信息,也不指示哪个操作失败。 必须从完成队列中检索错误的详细信息。
读取计数器的值很简单:
uint64_t fi_cntr_read(struct fid_cntr *cntr);
uint64_t fi_cntr_readerr(struct fid_cntr *cntr);原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。