In our last article [1], we discussed the concept of a perpetual server handling multiple concurrent clients and explored several design/implementation choices. These included creating one child process (or thread) per client connection, pre-creating a number of children processes (and/or threads) and assigning a new TCP connection to one of the free child process/thread and a single threaded server using the socket API call select()[2][3]. We briefly discussed the latter approach using select(), in which, the process maintains a list (array) of open sockets, and each time whichever socket is active, the process uses that socket. For example, when data is available to read from the socket or the socket has enough buffer for the data to be written into, then the network stack will be transmitted subsequently.
The approach of using select() performs better than other approaches of creating multiple threads/processes. As discussed in [1], one of the limitations of using select() is that it is generally limited to handling 1024 concurrent connections. This number turns out be quite insufficient today as modern servers need to deal with many more than 1024 concurrent connections [4]. There are other inefficiency issues when using select() especially on account of frequent switching between user space and kernel space execution of a process. In this article we focus on these limitations of select() and alternatives available (such as poll() [5], and epoll() [6]) to improve upon the performance.
Network (Socket) programming was first implemented with BSD Unix and C, which was the primary system development language at the time. Thus, the socket APIs implementation and man pages followed C function specifications. To understand performance issues associated with select, let us first understand its working. The man page for select() specifies its usage as:
int select(int nfds, |
fd_set *readfds, |
fd_set *writefds, |
fd_set *exceptfds, |
struct timeval *timeout); |
The first argument corresponds to maximum value of a file descriptor (fd) among all file descriptors (fds) in which the invoking program is communicating. In Unix, every I/O is implemented via an fd (irrespective of terminal I/O, disk I/O, network I/O etc.). The remaining three parameters correspond to file descriptors represented as bit values in a bit vector. This bit vector is of size 1024 bits (by default), and a change may require recompiling the Linux kernel with higher size. Each file descriptor (fd) is represented by a bit value of 1 at a bit position as per the value of fd. For example, if fd value is 100, then bit number 100 corresponds to this fd value 100. Thus, a process needs to set respective bits for all those sockets on which it is connected via TCP to other processes. For example, consider that a server process has established connections with 10 clients (C1 to C10) and all are communicating. Thus, initially, in general, fds corresponding to C1, C2…, C10 would be 3, 4, …, 12. The fd numbers 0, 1, 2 being used for standard input, standard output and standard error. Now consider that clients C2, C6, C7 and C9 have closed connections and thus remaining 6 connected clients are C1, C3, C4, C5, C8, C10. Thus, active fd numbers for this server process would be 3(C1), 5(C3), 6(C4), 7(C5), 10(C8), 12(C10). For these open sockets, the fdset value be 0x1478 (0b1010011101000) would be used in socket call as bits having value 1 corresponds to bit positions 3,5,6,7,10, and 12. The invoking process will pass address (reference) of fdset having value 0x1478 to select() call, informing the kernel about the fds in which this process is interested i.e. would like to process data on these fds. Consider that only the clients C5 and C10 have sent data and other clients C1, C3, C4, and C8 are idle. Thus, data from C5 and C10 needs to be read and acted upon by the server process. The kernel needs to convey only these sockets of interests upon return by select() call. The kernel implements this by retaining the corresponding bits in fdset (i.e. bit number 7 and 12) and resetting all other bits. Thus, upon return fdset (fdset is passed by reference in select() ) vector is modified, and select returns the value of fdset as 0x1020 (representing fd number 12 and 7 for clients C5 and C10). The calling process then acts upon these sockets and reads data (or write) on these two fds, and provides response to clients C5 and C10.
With this basic understanding of select(), let us analyse it from the performance perspective. For illustration purpose, we will consider the 2nd parameter readfds in select() usage, which corresponds to list of fds on which the process can receive the data, and same performance perspective will be applicable to writefds and exceptfds. Additionally, the fdset readfds also contains one special fd corresponding to the listening socket meant for accepting a new connection from some client. When server process invokes the select() call, for each open socket fd, it sets the respective bits in readfds. While the process code executes in user space, the select call is executed in kernel space and this fdset buffer is copied to kernel buffer, and scanned in kernel space execution so that underlying network implementation can identify the fds which needs to be checked for any activity i.e. if data has been received from client process. Then, for each of those fds on which some event has occurred (e.g. data has been received from the client), kernel sets the bits for these fds and resets the remaining bits in fdset readfds. The kernel achieves this by monitoring the state of each of fds. Thus, till the data is read fully on an fd by the server process invoking the select(), the kernel will always set the bit on its return from select() call for an active fd i.e. which has data in kernel buffer that process needs to read. This can be better understood by an example. Consider that a server is connected to N clients (C1, C2, …, CN) and client C1 sends some data (N1 bytes) at time T1 corresponding to fd value x, and another client C2 sends some data (N2 bytes) at time T2 (>T1) corresponding to fd value y, and server process invokes select call at time T3 (>T2) with bit value as 1 for all socket fds corresponding to these N clients including those bit number x and y. Upon execution of the select(), it will return fdset value with bit value as 1 for both fds x and y and all other N-2 bits as zero. Consider that server process performs a read operation on socket x and reads all the N1 bytes, and another read operation on fd y but reads only M2 (<N2) bytes and then invokes select() call again. The kernel knows that there is still (N2-M2) bytes of data yet to be read on socket fd y and thus it will return the yth bit as 1 in readfds and other bit values as 0.
When the returned value of readfds contains the bit corresponding to listening socket, it implies a new connection and process should execute accept() to establish a new TCP connection, which results in a new socket fd and same should be added to list of open fds. Similarly, when an existing connection is closed, the corresponding bit in fdset should be set as the value 0.
The 3rd parameters of select() corresponds to writeable fds i.e. it contains fds into which the program would like to send(write) the data. The program would like to write the data when socket is writable i.e. there is memory buffer available in the kernel networking stack for the specified socket. The buffer space may become full when the client at other end of socket is not reading any data (manifested as TCP Zero Window phenomenon). In such a situation, if the program writes the data into the socket, the program would be blocked by the kernel till some buffer becomes available. This blocking of the process would have repercussions on communications with other client applications thereby impacting the performance.
Similarly, the 4th parameter in select() corresponds to exceptional fds, which generally consists of all open fds being used by the program. When the network stack encounters any errors on an fd, for example, the other end has suddenly closed the connection and thus this socket is closed by the network stack, then this is indicated by setting the corresponding bit in the exceptional fdset exceptfds. This fdset provides a mechanism for the underlying network to notify the invoking application about any exception/error events other than read and write that occurs on the socket.
Table 1 : Snippets of socket program using select()
01: | fd_set rset; /* master read fdset */ |
02: | fd_set wset; /* master write fdset */ |
03: | fd_set wk_rset; /* working read set to be passed to select*/ |
04: | fd_set wk_wset; /* working write set to be passed to select */ |
05: | int listenfd, |
06: | int connfd; |
07: | int maxfd; |
08: | |
09: | listenfd = socket (AF_INET, SOCK_STREAM, 0); |
10: | bind(listenfd, (struct sockaddr *) &servaddr, sizeof(servaddr)); |
11: | listen(listenfd, LISTENQ); |
12: | FD_SET(listenfd, &rset); |
13: | max_fd = listenfd + 1; |
14: | cnt = select(max_fd, &wk_rset, &wk_wset, NULL, &timeout); |
15: | if (FD_ISSET(listenfd, &wk_rset)) { |
16: | connfd = accept(listenfd, (struct sockaddr *) &cliaddr, &clilen); |
17: | FD_SET(connfd, &rset); FD_SET(connfd, &wset); |
18: | max_fd <= connfd: connfd + 1 ? max_fd |
19: | } |
20: | for (int ifd = listenfd+1; ifd < max_fd; ifd++ ) { |
21: | if (FD_ISSET(ifd, &wk_rset)) /* data is available for read */ |
22: | recv(ifd, buf, sizeof(buf), 0); /* read the data received |
23: | /* process the response and send |
24: | } |
A typical sample snippets of socket program using C is shown in Table 1, and its full version is available as tcpserver_select.c [7]. Lines 01-02 correspond to master fdsets rset(read) and wset(write), and lines 03-04 for the working copy wk_rset and wk_wset of fdsets which is passed to select and is updated upon return. Each time select() is invoked, rset/wset is copied into wk_rset/wk_wset (not shown in the table). Line 05 declares fd for listening and accepting new connection from clients. Each time a new connection is accepted, it gets its own fd, which is temporarily stored in connfd (line 06). Lines 09-11 are typical for any server side socket programming to let underlying operating system know that server is ready to accept the connections from clients. Line 14 shows the usage of select(). Line 15 checks for arrival of new connection and if yes, it is accepted at line 16 and new file descriptor connfd is added to master fdset in line 17. Line 18 is used to determine the maximum value of fd which should be used for scanning the fds. Lines 20-22 iterate over each fd value to check if the corresponding bit is set upon return from select() and if so, process the data received. The lines 14-21 are run forever in a loop (not shown here).
With the above mentioned description of basic operation of select() call, below we analyze its performance related aspects
Number of concurrent connections. Today’s internet server needs to serve a large number of concurrent connected clients. The number of such connections easily exceeds thousands. A typical client e.g. mobile phone or even a browser on laptop when connects to a website, keeps the TCP connection on (live) even though it may not be transmitting data intermittently. For example, after a user has accessed a URL, the user may take a while to read the content returned by URL and during this reading time, no data transmission (neither by client nor by the server) takes place. In a typical scenario, out of these large number of live TCP connections, at any point of time only few connections are used for transmitting data and other connections remain idle. However, fdset bit vector has only 1024 bits, and this means that a server can only connect to 1021 (=1024-3 for stdin, stdout and stderr) clients. This puts a severe constraint on the server process to serve its large number of clients and thus an alternative approach to select() is needed.
fdset scanning overhead. Each time, select() is invoked, kernel scans the fdset bit vector to check for active sockets associated with invoking process. An active socket is one which has seen some activity and require processing by the invoking program. For each such active socket, the kernel sets the respective bits and resets all other bits (corresponding to inactive sockets) and return the bit vector fdset. Upon return of the fdset value, the invoking process must scan each bit of fdset to identify the active socket fds and process these. Thus, there is always an overhead of double scanning of this fdset bit vector (once by kernel and once by the invoking process). Since this is a continuous operation being carried out for ever, it does degrade the performance.
To comprehend this scanning overhead, consider a scenario where at some point a server is connected to as a large number of concurrent clients, for example 1000. Now assume that client number 1000 is an active client for long duration interaction i.e. it continues to periodically interact with server process for a long time. Thus, bit number 1000 of fdset readfds will always be set whenever select() is invoked. Further consider that most of the other clients has completed the interaction with the server and only few clients continue to communicate, and when new clients come, these get the socket fd value as a small number e.g. 10. Thus, when server process invokes select(), it will set bit 10 and bit 1000 every time in fdset, and the value of nfds (first parameter) will be 1001. Thus, kernel needs to scan the entire bit vector from 1 to 1000 even though only few fds are of interest to server. Assuming socket fd 1000 (corresponding to long duration interactive client) always has data even though it may be just few bytes. In a typical implementation, the client process scans the fdset again from the beginning (bit 1) to the end (bit 1000) to find that there is data on socket 1000. Thus, in such a skewed distribution of socket fds, the scanning overhead does lot of redundant work, thereby impacting the performance.
Further, since the fdset is modified by select() upon return, it becomes necessary for the invoking process to maintain a master copy of fdset of all open fds e.g. rset, wset (lines 1-2 in Table 1). Using this master copy, invoking process must create a working (or temporary) copy of fdset e.g. wk_rset, wk_wset, which needs to be passed to select(). Thus, this is an extra overhead for invoking process to maintain a master copy of all the sets and involves one more buffer copying operation on each invocation of select().
State based approach. Kernel maintains the state of each open socket associated with a process whether it has an event of interest (e.g. data to be read or buffer available to write) irrespective of previous invocation of select(). Whenever select() is invoked by a process, kernel returns fdset value based on the state of each socket. For example, whenever data is received on a socket fd for the process to read, kernel keeps track of whether the process has read the complete data. Kernel maintains this state of the socket at all times. Thus, even if a socket has been notified earlier (via setting the bit corresponding to respective socket fd), but the invoking process has not fully read the data, kernel uses this state information to notify the process again. This maintenance of state for each socket is a significant overhead for kernel.
Context switching between user space and kernel space [8][9]. The last parameter of select() call specifies the timeout after which the select call must return when no socket is ready with an event. However, if any socket is active, the select() call returns immediately. When the select is invoked, its execution happens in kernel space (till timeout occurs), the kernel iterates over the fdset, and for each socket fd, it registers callbacks for an event notification [12]. Thus, whenever an event occurs, such as data received on a socket, it iterates again over the fdset to deregister the callbacks. This results in significant work for kernel making it a heavy weight operation. Typically, the moment an event occurs, select() call returns and code execution enters user space. The process has to iterate fdset all over again to identify the active socket fds. After processing the active sockets, the process again invokes the select(), and this cycle repeats forever. Thus, for all practical purpose, for each event on network socket, there is continuous switch between user space and kernel space, a costly operation.
The program samples described here (Table 1 with full working code given in [7]) are written in C (socket programming was implemented at first in C). Our previous article [1] on socket programming used Python as the programming language for simplicity of understanding [10]. The Python language does support select() API and uses the same semantics as in C, but has made it bit easier from programming perspective. The Python select() method returns 3 lists one each for read, write and exceptions which are of interests to the invoking program.
A sample Python snippet is given in Table 2 with full working code available at [7]. At first glance its implementation looks efficient since invoking program does not do any scanning on sockets to check if it is ready to be processed. However, it is just a convenience implemented in Python since Python implements select() as a wrapper over the underlying Unix select(). It internally scans over all fds, and returns list of objects for each fdset corresponding to read, write and exception. Thus, though the scanning of fd is hidden from the invoking program, it is performed internally by Python socket library and thus all limitations of original select() are applicable to Python as well.
Table 2: Using select with Python
01: | lsock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) |
02: | lsock.bind((ip_addr, port)) |
03: | lsock.listen(1) |
04: | |
05: | csocks = [] # list of active client sockets |
06: | while True: |
07: | (rl,wl,el)=select.select([lsock]+csocks, [], csocks, timeout) |
08: | for rsock in rl: |
09: | if rsock is lsock: # a new connection has arrived |
10: | newsock, client = lsock.accept() |
11: | csocks.append(newsock) |
12: | else: # data is available to read on existing connection |
13: | message = rsock.recv(buffer) |
14: | rsock.send(message.decode().upper().encode()) |
15: | |
16: | for errsock in el: |
17: | errsock.close() |
To address the limitations on number of socket fds, and avoid maintaining a separate master copy, and poll() socket API is introduced. The usage of poll() is defined as
struct pollfd { |
int fd; |
short events; |
short revents; |
} |
int poll(struct pollfd filedes[], unsigned int nfds, int timeout;) |
The first parameter to poll() is an array of open fds and it does not impose any limits on the size of array. Thus, a process can have any number of open socket fds that can be concurrently open and acted upon. Thus, the first benefit of poll() over select() is removal on the limit of number of open fds. This becomes useful when a server program needs to concurrently communicate with a large number of clients.
In the structure of pollfd, the 2nd and 3rd fields are events and revents. The field events specifies all the events the invoking program is interested in such as read, write, exception and even other events which is not supported by select(). An example of other events could be high priority data to read (corresponding to TCP urgent data in TCP headers [11]), invalid file descriptors etc. The field revents is the value returned by the kernel specifying the events which have occurred. Thus, upon execution of poll(), it does not modify (or destroy) the original event list or socket fds. The invoking program does not have to maintain a separate master copy of open fds. Hence, every time poll() is invoked, no buffer copying from master copy to temporary copy is required. However, the onus of checking the occurrence of events for each fd still lies with the invoking program. The program must check for each fd in the array to identify which events have occurred. The invoking program needs to process all the fds where events (e.g. data available on the socket to be read) have occurred.
Table 3: Snippets of socket program using poll()
01: | struct pollfd cl[MAX_CLIENTS]; |
02: | int lfd, connfd, cllen; |
03: | struct sockaddr_in claddr |
04: | listenfd = socket (AF_INET, SOCK_STREAM, 0); |
05: | bind(listenfd, (struct sockaddr *) &servaddr, sizeof(servaddr)); |
06: | listen(listenfd, 5); |
07: | |
08: | cl[0].fd = listenfd; |
09: | cl[0].events = POLLIN; |
10: | maxfd = 0 |
11: | |
12: | for ( ; ; ) { |
13: | cnt = poll(cl, max_ifd + 1, POLL_TIMEOUT) ; |
14: | if (cl[0].revents & POLLIN) { |
15: | maxfd += 1 |
16: | connfd = accept(lfd,(struct sockaddr *)&cliaddr, &clilen); |
17: | client[maxfd].fd = connfd; |
18: | client[maxfd].events = POLLIN; |
19: | } |
20: | |
21: | for (ii = 1; ii <= max_ifd; ii++ ) { |
22: | if (client[ii].revents & POLLIN) { |
23: | recv(clientfd, buf, sizeof(buf), 0); |
24: | /* process client data */ |
25: | } |
26: | } |
27: | } |
A sample program snippet using poll() is shown in Table 3 with full code available at [7]. Line 01 declares the array of client fds to be passed to poll(). After creating a listen socket in lines 04-06, it is set with event POLLIN (lines 08-09) so that whenever new connection comes, it can know about the same. The server program runs forever between lines 12-25. The number of valid entries is maintained in the array is determined by maxfd, and first entry in poll fd points to listen socket. All other array elements will contain entries corresponding to TCP connections with clients. Program invokes poll() at line 13, and it first checks if a new connection has arrived (line 14). If so, it is accepted (line 16) and fd for new socket is added to array (lines 15-18). In the lines 20-23, for each of the open socket, check is made if there is data to be read and if yes, it is read and processed in lines 22-23. The program runs for ever. In this same code we haven’t shown the use of POLLOUT event (which is to know if socket buffer is available for write), and POLLERR event to determine any error condition. In actual code, these events need to be processed as required.
Though poll interface mainly addresses the issue of limited number of open sockets and avoids double copying, the other issues still persist as applicable to select(). The problem of scanning the fd array persists as in the case of select(). The invoking program must iterate over all the fds in the array and check, which is a significant overhead especially if the array size is large (i.e. large number of open socket fds), but the event occurs only on few of them. However, the scanning overhead is reduced a bit. In select, fdset size is fixed (1024 bits) and thus scanning happens till the highest number socket fd, whereas in poll() the size of pollfd array corresponds to number of open fds. Thus, only those fds are scanned which are open. Essentially poll() has the same scaling problem as in select() since scanning the fds requires work proportional to number of open fds instead of active fds (the fds of interest).
Though one does not need to maintain master copy of fds thus copying from master copy to temporary copy is avoided, but copying from user buffer to kernel buffer still happens. The amount of data that is copied may actually exceed that for select(). In poll(), each fd requires 64 bits (size of struct pollfd), compared to only 3 bits (1 bit each for read, write and exception) in select() for each fd. Thus, if number of open sockets are more than 3/64*1024, poll() involves more copying than select(). The other common mistake made by a typical developer is when an fd is closed. In this case, the developer sets the field event to 0 (implying no more interested in any events as fd is closed). However, this will result in EBADF (error Bad FD) which requires unnecessary processing of this fd. Thus, the best way to handle the closed fd is to remove this fd from the array itself. The simplistic way would be to swap this array entry with the last fd entry of the array and decreasing the array size by 1.
Similar to select() implementation, the implementation of poll() is also state based i.e. kernel must maintain state of socket about the event handling [8]. Thus, if an event for a socket has been notified in the previous invocation of poll() and same has not been fully acted upon by the process, on the next invocation of poll() the kernel will again notify the invoking process about existence of the event as it maintains the state of socket. Similarly, switch of program execution between user space and kernel space continues to impact the system performance.
Therefore, from efficiency and scaling perspective, it is desirable that the invoking program should get the list of fds of interest i.e. those fds which are ready for processing. Both select() and poll() follow state based approach [8] which requires kernel to maintain the state of each fd, and provide response after checking the state of each fd. The kernel I/O subsystem is inherently designed using events based approach, such as, arrival of data packets, closing of a socket at the other end, getting a new connection request etc. In implementation of poll() and select(), kernel needs to transform the information from events based approach to state based approach and vice versa.
Thus, it is desirable to have a socket API that uses an event based approach instead of a state based approach. Linux supports an API epoll() [6] that implements events based approach.
In this approach, once an event occurs and kernel notifies the process about it, the kernel‘s responsibility is over. After a process is notified about the event occurrence, it can process the event immediately or make a note of it to be processed later. Further, the occurrence of events is monotonic in the sense that once an event occurs, it never decreases the amount of the information to be processed e.g. amount of data to be read or amount of buffer space available for writing. The main advantage this approach provides is that work done by kernel between two successive invocation of epoll() is proportional to number of events that have occurred during the invocation interval and independent of number of open socket fds a process is working with to serve respective clients.
The approach of events based notification poses its own challenges. Consider that between time T1 and T2, N events have occurred (e.g. other end has sent data N times during this interval) on a single socket fd. There are two choices for kernel to inform about these events. One choice is to inform the process about occurrence of each event (i.e. each time data is received). The other choice is to inform the process only once i.e. about the first event and treat remaining N-1 events coalesced into the first event. The information about subsequent N-1 event is not required till a process starts acting upon the first event. Using the former approach would cause too much work for the involved process and in general latter approach is preferred. Another aspect one needs to consider whether these event notifications be asynchronous or synchronous. Asynchronous notification would mean that kernel informs the process whenever the event occurs and thus interrupts the process from whatever task it is doing. Its implementation would involve signals based mechanism. The synchronous notification would imply that kernel informs the process only when it asks for it and thus, event is stored by the kernel till the process asks for it. From the implementation perspective, programming with the latter approach is simpler. Further, in general, till a program processes the first event, it would not really care about all those subsequent events that occur after the first event but before the time program begins to process it.
The new socket API that supports such an implementation is epoll() mechanism introduced in Linux 2.6 [6]. The benefit is there is no linear scan of fds by user program and kernel. There is no limit on number of socket fds that can be used with this and thus enables a process to serve large number (>1024) concurrent clients. The usage of epoll() mechanism is given in terms of 3 APIs [13] [14][15] as below.
typedef union epoll_data { |
void *ptr; /* for user data */ |
int fd; |
uint32_t u32; |
uint64_t u64; |
} epoll_data_t; |
struct epoll_event { |
uint32_ t events; /* Epoll events */ |
epoll_data_t data;/* User data variable */ |
}; |
int epoll_create(int size); |
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); |
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); |
The first API epoll_create() creates an instance of of epoll(), and returns an fd (e.g. epfd) which is to be used in all subsequent invocations of epoll interface. The 2nd API epoll_ctl() tells the epoll mechanism which are the fds of interest to the process. This enables a process to add new fds for monitoring, remove a closed fd, and modify an fd for newer events of interests (specified by struct epoll_event). The information about events of interest is provided as a bit mask e.g. EPOLLIN for read, EPOLLOUT for write, EPOLLERR to detect some error condition, EPOLLET (for edge triggered behaviour described below) etc. The 3rd API epoll_wait() returns the count of descriptors that are of interest to the invoking process. Thus, the invoking process does not scan each of the fd, but instead gets a list of only those fds which needs to be processed.
Epoll interface introduces a concept of level triggered (the default value) and edge triggered (EPOLLET) event. The former is kept for compatibility and is just a faster version of poll(). Level triggered event mechanism is used whenever a program would like to use the semantics of poll(). The edge triggered interface (EPOLLET) implements event driven mechanism i.e. the kernel notifies the program for an event only once. To understand the same, consider the example of a simple client/server communication on a socket fd. Consider that client sends 1000 bytes of data which generates one read event for server. When server invokes epoll_wait, it returns fd for reading. Consider that server reads only 500 bytes of data and decides to postpone the read of remaining data some time later. However, when server invokes epoll_wait() again, there is no event for this fd. This API call would be blocked since kernel would wait for next event to occur (new data to be received) even though data is available on the socket for program to read. This is because, kernel is not using state based approach. To avoid such a case where server program gets blocked, the server must make this fd as non-blocking. With non-blocking socket, when read is done on the socket (which has no data), it returns error EAGAIN. Thus, using epoll_wait(), a socket fd must be added to the array only after read on it returns EAGAIN. The implementation of epoll does support more complex operations such as only one notification for multiple events (when client has sent multiple chunks of data resulting in multiple events). The oneshot notification can be specified with event type as EPOLLONESHOT, which we will not discuss here and suggest the reader to explore the same after having gained familiarity with the basics of epoll interface.
Table 4: Snippets of socket program using new epoll()
01: | #define MAX_EVENTS 100 /* num of max active events */ |
02: | int efd; /* epoll instance fd */ |
03: | struct epoll_event event; /* to register events for new fd */ |
03: | struct epoll_event revents[MAX_EVENTS]; /* events from epoll_wait */ |
04: | |
05: | listenfd = socket (AF_INET, SOCK_STREAM, 0); |
06: | bind(listenfd, (struct sockaddr *) &servaddr, sizeof(servaddr)); |
07: | listen(listenfd, 5); |
08: | |
09: | efd = epoll_create(MAX_EVENTS) |
10: | event.data.fd = listenfd; |
11: | event.events = EPOLLIN; /* by default, Level triggered */ |
12: | status = epoll_ctl (efd, EPOLL_CTL_ADD, listenfd, &event); |
13: | |
14: | for (; ; ) { |
15: | cnt = epoll_wait (efd, revents, MAX_EVENTS, -1); |
16: | for (ii = 0; ii < cnt; ii++) { |
17: | if (revents[ii].data.fd == listenfd) { |
18: | connfd = accept(listenfd, (struct sockaddr *) &cliaddr, &clilen); |
19: | event.data.fd = connfd; |
20: | event.events = EPOLLIN; |
21: | status = epoll_ctl(efd, EPOLL_CTL_ADD, connfd, &event); |
22: | } else {/* data from some client on existing connection */ |
23: | recv(clientfd, buf, sizeof(buf), 0); |
24: | } /* end if -else */ |
25: | } /* end for ii=0 */ |
27: | } /* end for(;;) */ |
A sample program snippet using epoll interface APIs is shown in Table 4 with full code tcpserver_epoll.c available at [7]. For reasons of brevity, only those variables have been declared which are specific to the use of epoll(). The instance for epoll interface is created at line 09. The listening socket to accept new connections is added to list of fds at lines 10-12. The execution of epoll_ctl() for an fd to be done only once for all the events of interest. This has to be modified only when events of interest change, for example, when using non-blocking sockets and a process defers reading of complete data. In this code snippet, only EPOLLIN event (for read) is added. Other event for write (EPOLLOUT), edge triggered (EPOLLET) needs to be used in a real program. The program runs forever between lines 14-27. Fds of interests (for which some events have occurred) are returned by epoll_wait() at line 15. New TCP socket connection is accepted and added to list of sockets in lines 17-21. For existing TCP connections, whenever a socket fd receives an event (such as when respective client sends the data), the data is read at lines 22-24 and processed as required.
We have discussed the evolution of socket program starting from a simple client/server program [1], then evolved to a server programming model handling multiple concurrent clients using multiple threads/processes, and then finally to a single threaded server process using select(), poll() and epoll(). Use of select() is good enough for a server program when number of concurrent clients are limited to few hundreds. If the server program has to serve large number of concurrent clients [4], then using poll() can be used in place of select() following the same program semantics and it is an easier switch. However, when efficiency and service response time plays an important role for a server program, then using epoll() is the best choice. However, this requires that developer should be familiar with non-blocking sockets, which makes programming a bit more complex. Using epoll interface without using non-blocking sockets would most likely result in unsatisfactory performance as it may be possible that communication on one socket fd may cause process to block resulting in no or delayed response on other sockets (clients).
Thus far we have discussed the transport layer covering TCP and UDP and provided a number of programming scenarios to enable the experiential learning of transport layer. In our subsequent articles, we will explore network layer, primarily Internet Protocol (IP), its addressing scheme, subnetting and routing and continue the same approach of learning with simple real practical examples.
To gain an experiential understanding of select(), poll() and epoll(), a working version of server programs using each of these API is available at [7], respectively as tcpserver_select.c, tcpserver_poll.c, and tcpserver_epoll.c. These programs are written for Ubuntu Linux machine and can run on Ubuntu 14.04, 16.04 and 18.04 LTS (Long Term Support) versions. The client can be a simple netcat (nc) utility or a Python or C program. A working version C client tcpclient.c is also available at [7], which supports many configuration parameters, such as, configure number of data packets to send, size of each data packet and interval between successive packets. Using the combination of these parameters, this program can be tuned to work as different type of clients, and help us to study and understand the behaviour of select(), poll() and epoll() APIs. Below, we describe some simple experiential exercises to develop a better understanding of these socket APIs.
Table 5: Compiling C programs
|
Note: All the C programs need to be compiled before these can be executed. The default IP address and port number for server program is taken as 0.0.0.0 and 9999 but different values can be specified at the time of running these programs. A simple usage of compiling and running these programs is shown in Table 5.
Topic: TCP Server using select() to serve few clients
Invoke the server program select (tcpserver_select.c)
$ ./select -s 10.211.55.11 -p 9999 -t 10
Invoke 5 instances of client program as below in 5 separate terminal windows.
./tcpclient -s 10.211.55.11 -p 9999 -c 50 -i 5000 -d 1
./tcpclient -s 10.211.55.11 -p 9999 -c 10 -i 1000 -d 2
./tcpclient -s 10.211.55.11 -p 9999 -c 10 -i 1000 -d 3
./tcpclient -s 10.211.55.11 -p 9999 -c 60 -i 3000 -d 4
./tcpclient -s 10.211.55.11 -p 9999 -c 10 -i 1000 -d 5
The 1st and 4th client run respectively for 250s, and 180s whereas other 3 clients (2nd, 3rd, and 5th ) runs for 10 seconds. The server simply displays the new connection whenever it is accepted.
Learning: Working of socket select() to serve multiple clients.
Topic: TCP Server using select() to demonstrate the overhead of scanning full fdset bit vector.
Invoke the server program select (tcpserver_select.c)
$ ./select -s 10.211.55.11 -p 9999 -t 10
Invoke about 500 client programs such that client #1, client #11 and client #511 runs for longer duration (e.g. 10 minutes) and other clients run for shorter duration (e.g. 1 minute). An example invocation of creating clients with this behaviour is given below
# client #1 creates server socket fd=3 or 4 |
./client -s 10.211.55.11 -p 9999 -c 300 -d 1 -i 2000 & |
# client #2 to 10, creates server socket fd from 4 to 14. |
i=1; |
while [ $i -lt 10 ]; do |
(./client -s 10.211.55.11 -p 9999 -c 20 -d 2 -i 1000 &); |
i=$(($i+1)); |
done |
# client #11 creates server socket fd=13 or 14 |
./client -s 10.211.55.11 -p 9999 -c 300 -d 3 -i 2000 & |
# client #12 to 511, creates server socket fd from 15 to 514. |
i=1; |
while [ $i -lt 500 ]; do |
(./client -s 10.211.55.11 -p 9999 -c 20 -d 2 -i 1000 &); |
i=$(($i+1)); |
done |
# client #515 creates server socket fd=515 or 14 |
./client -s 10.211.55.11 -p 9999 -c 300 -d 4 -i 2000 & |
Client #1, #11 and #515 sends 300 message at interval of 2s, whereas other 500+ clients sends 20 message at interval of 1s.
After about 24 seconds, only 3 clients will remain live with their corresponding fd on server side being 4, 14, and 514.
The code if (FD_ISSET(ifd, &wk_rset))} at line 150 in tcpserver_select.c is executed about 500 times for the client #514 indicating the wasteful scanning of fdset. The code can be modified to see how many times this condition is executed before an fd is found to be active.
Learning: Understanding overhead of scanning fdset in select() when some fd number has a high value
Topic: TCP Server using select() to reach the limit of concurrent clients
Invoke the server program ./select (tcpserver_select.c)
Invoke the client program more than 1024 times. This should exceed size of fdset of server and thus accept() should fail.
i=1; |
while [ $i -lt 1025 ]; do |
(./client -s 10.211.55.11 -p 9999 -c 20 -d 2 -i 1000 &); |
i=$(($i+1)); |
done |
The server should give error during accept since accepted fd will exceed 1024 for last few clients. T
Learning: Understanding the limit on number of clients that can be connected with a TCP Server using select().
Topic: TCP Server using poll() to serve concurrent clients
By default, Linux system has limit of 1024 on number of open file descriptoros. For poll() to work exceed this limit, the soft limit should be increased. Use the following to temporarily increase the soft limit to 2000 from 1024 on number of files a process can open,
$ ulimit -Sn 2000
Invoke the server program ./poll (tcpserver_poll.c)
Invoke the client with few concurrent clients as below.
# client #1 to 1025, creates server socket fd from 4 to 14. |
i=0; |
while [ $i -lt 1025 ]; do |
(./client -s 10.211.55.11 -p 9999 -c 20 -d 2 -i 1000 &); |
i=$(($i+1)); |
done |
TCP server accepts all the 1025 clients thus crossing the limit of 1024 as constrained by fd_set..
Learning: Understanding use of poll() to deal with concurrent clients and exceeding the limits to 1024.
Topic: TCP Server using epoll() to serve concurrent clients
Repeat the first two steps of Exercise 4, i.e. ensure that number of open file descriptors exceed 1024, and run the epoll server program instead of poll server program
$ ulimit -Sn 2000
Invoke the server program ./epoll (tcpserver_epoll.c)
Invoke the client with 1024+ (e.g. 1025) clients as below.
# client #1 to 1025, creates server socket fd from 4 to 14. |
i=0; |
while [ $i -lt 1025 ]; do |
(./client -s 10.211.55.11 -p 9999 -c 20 -d 2 -i 1000 &); |
i=$(($i+1)); |
done |
TCP server accepts all the 1025 clients thus crossing the limit of 1024 as in select().
The code for (ii = 0; ii < cnt; ii++) at line 120 (tcpserver_epoll.c) ensures that only active fds are looked into as returned by epoll_wait. There is no full scan of all open fds as is the case for select() and poll().
Learning: Understand the use of epoll() and how it avoids the overhead of checking all open fds.
Ram Rustagi, Viraj Kumar, “Evolution of Socket Programming – Part-I”, ACCS journal of Computing and Communications, Vol 3, Issue 4, December 2019, https://journal.accsindia.org/experiential-learning-of-networking-technologies-evolution-of-socket-programming-part-i/, last accessed Mar 2020.
Ubuntu man page for listen socket call. http://manpages.ubuntu.com/manpages/bionic/man2/listen.2freebsd.html
Man page of select() api, http://manpages.ubuntu.com/manpages/trusty/man2/select.2.html, last accessed Dec 2019.
Leitner, Scalable network programming – Quest for a good web server, http://bulk.fefe.de/scalable-networking.pdf, last accessed Mar 2020.
Man page of poll(), wait for some event on file descriptor, http://man7.org/linux/man-pages/man2/poll.2.html, last accessed Mar 2020.
Man page of epoll: I/O Event notification facility, http://man7.org/linux/man-pages/man7/epoll.7.html, last accessed March 2020.
Ram P Rustagi, source code for example programs using select, poll and epoll along with a client program to connect to server, https://github.com/rprustagi/EL-Evolution-of-Server-Socket-Programming, last accessed March 2020.
Gaurav Banga, A scalable and explicit event delivery mechanism for Unix.
https://www.usenix.org/legacy/events/usenix99/full_papers/banga/banga_html/final2.html
Hurby, Crivat et al, Minimizing context switches for socket APIs, https://www.usenix.org/system/files/conference/trios14/trios14-paper-hruby.pdf
Kurose, Ross, “Computer Networking: A Top Down Approach, 6th edition, Pearson,
RFC 793, “Transmission Control Protocol, https://tools.ietf.org/html/rfc793. Last accessed Dec 2019.
Marek, Select is fundamentally broken, https://idea.popcount.org/2017-01-06-select-is-fundamentally-broken/
epoll_create(): Linux man page – open an epoll file descriptor, http://man7.org/linux/man-pages/man2/epoll_create.2.html
epoll_ctl: Linux man page – Control interface for epoll file descriptor, http://man7.org/linux/man-pages/man2/epoll_ctl.2.html., last accessed march 2020.
epoll_wait(): Linux man page – Wait for an I/O event on an epoll file descriptor, http://man7.org/linux/man-pages/man2/epoll_wait.2.html