CCI has a few key concepts outlined below.
CCI is designed to work over a number of networks using either an existing low-level API or implemented directly in the network hardware or its firmware. A transport is a loadable module that implements CCI for a given network. A single transport may manage one or more devices. We are developing transports for OpenFabrics Verbs for InfiniBand and RDMA over Converged Ethernet (RoCE), Cray General Network Interface (GNI) for the Gemini and Aries interconnects, and Linux Ethernet Direct for IP-bypass and reduced copies on Linux hosts using generic Ethernet NICs. At least one vendor is working on a hardware implementation of CCI as well.
A device represents a network interface card (NIC) or host channel adapter (HCA) that connects the host and network. A device may also be a bonding device that aggregates multiple physical devices from the same transport. A device may provide multiple endpoints (typically one per core or one per socket).
In CCI, an endpoint is the process’ virtual instance of a device. The endpoint is the container of all the communication resources needed by the process including queues and buffers including shared send and receive buffers. A single endpoint may communicate with any number of peers and provides a single completion queue regardless of the number of peers. CCI achieves better scalability on very large HPC and data center deployments since the endpoint manages buffers independent of how many peers it is communicating with.
CCI uses an event model in which an application may either poll or wait for the next event. Events include communication (e.g. send completions) as well as connection handling (e.g. incoming client connection requests). The application may return an event to CCI out of order when it no longer needs the event. CCI achieves better scalability in time versus Sockets since all events are managed by a single completion queue.
Unlike Sockets, a new endpoint is bound to a specific device and is ready to a receive incoming connection requests.
CCI uses connections to allow an application to choose the level of service that best fits its needs and to provide fault isolation should a peer stop responding. CCI does not, however, allocate buffers per peer which reduces scalability.
CCI offers choices of reliability and ordering. Some applications such as distributed filesystems need to know that data has arrived at the peer. A health monitoring application, on the other hand, may want to send data that has a very short lifetime and does not want to expend any efforts on retransmission. CCI can accommodate both.
CCI provides Unreliable-Unordered (UU), Reliable-Unordered (RU), and Reliable-Ordered (RO) connections as well as UU with send multicast and UU with receive multicast. Most RPC style applications do not require ordering and can take advantage of RU connections. When ordering is not required, CCI can take better advantage of multiple network paths, etc.
Also, with reliable connections, the send completion (whether small messages or RMA) indicates that the data has been delivered to the peer’s memory space unlike Sockets in which a send completes when the data has been buffered locally. The CCI send completion does not mean that the peer has processed the data and does not replace an application-level acknowledgement if needed.
CCI has two modes of communication: messages (MSG) and remote memory access (RMA).
CCI’s MSGs are small (typically MTU sized) messages. These are unexpected messages and the receiver does not need to post receives. Instead, the receiver’s CCI library will generate a receive event that includes a pointer to the data within the CCI library’s buffers. The application may inspect the data, copy it out if needed longer term, or even forward the data by passing it to the send function.
When an application needs to move bulk data, CCI provides RMA. To use RMA, the application will explicitly register memory with CCI and receive a handle. The application can pass the handle to a peer using a MSG and then perform a RMA Read or Write. The RMA may also include a Fence. RMA requires a reliable connection (ordered or unordered). An RMA may optionally include a remote completion message that will be delivered to the peer after the RMA completes. The completion message may be as large as a full MSG.