|
| 1 | +# Linux Epoll |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Epoll is not just a user-space API, it’s a kernel subsystem that maintains a persistent, event-driven mechanism to monitor I/O readiness of multiple file descriptors. |
| 6 | + |
| 7 | +Internally, epoll revolves around a kernel object called `eventpoll`. |
| 8 | + |
| 9 | +### Key Components of `eventpoll` Object |
| 10 | + |
| 11 | +#### 1. The Wait Queue (`wq`) |
| 12 | +This queue holds the list of processes (threads) that are currently blocked in `epoll_wait()`, waiting for an event to occur. When an event happens, the kernel wakes up the processes in this queue. |
| 13 | + |
| 14 | +#### 2. The Ready List (`rdlist`) |
| 15 | +This is a **doubly linked list** that stores the file descriptors (FDs) that are currently "ready" (i.e., have data to read or space to write). |
| 16 | +- When an event occurs on a monitored FD, it is added to this list. |
| 17 | +- `epoll_wait()` simply checks this list. If it's not empty, it returns the events to the user. |
| 18 | + |
| 19 | +#### 3. The Red-Black Tree (`rbr`) |
| 20 | +This is a **Red-Black Tree** that stores all the file descriptors currently being monitored by this epoll instance. |
| 21 | +- It allows for efficient **insertion, deletion, and search** of file descriptors (O(log n)). |
| 22 | +- When you call `epoll_ctl()` to add or remove an FD, the kernel modifies this tree. |
| 23 | + |
| 24 | +## How it works: The Lifecycle |
| 25 | + |
| 26 | +### **Creating an Epoll instance**: `epoll_create()` |
| 27 | + |
| 28 | +```c |
| 29 | +int epoll_create1(int flags) |
| 30 | +``` |
| 31 | +
|
| 32 | +Here the flags can be either `0` or `EPOLL_CLOEXEC` |
| 33 | +
|
| 34 | +If the flag is set to zero, this function behaves like the older `epoll_create()` API, but without its obsolete size argument (as size is now dynamically managed by kernel). |
| 35 | +
|
| 36 | +If the flag is set to `EPOLL_CLOEXEC`, it will inform the kernel to set the close-on-exec (`FD_CLOEXEC`) flag on this epoll file descriptor. So if your process later calls `exec()` (e.g., `execvp()` after a `fork()`), the epoll FD will automatically be closed in the child process. This prevents leaking file descriptors into executed programs. |
| 37 | +
|
| 38 | +When you call `epoll_create1(0)`, the kernel allocates a new `eventpoll` object in kernel memory. This object is the heart of the epoll instance and contains the following key members: |
| 39 | +
|
| 40 | +```c |
| 41 | +struct eventpoll { |
| 42 | + /* Wait queue for sys_epoll_wait() */ |
| 43 | + wait_queue_head_t wq; |
| 44 | +
|
| 45 | + /* List of ready file descriptors */ |
| 46 | + struct list_head rdlist; |
| 47 | +
|
| 48 | + /* Red-black tree root used to store monitored file descriptors */ |
| 49 | + struct rb_root_cached rbr; |
| 50 | +
|
| 51 | + /* Lock for protecting the structure */ |
| 52 | + spinlock_t lock; |
| 53 | + |
| 54 | + /* ... other fields ... */ |
| 55 | +}; |
| 56 | +``` |
| 57 | + |
| 58 | +Finally, a file descriptor is returned to the user-space referencing this `eventpoll` object. Applications can use this descriptor with `epoll_ctl()` and `epoll_wait()` to manage monitored file descriptors and retrieve events. |
| 59 | + |
| 60 | +### **Registering a File Descriptor with Epoll**: `epoll_ctl()` |
| 61 | + |
| 62 | +```c |
| 63 | +#include<sys/epoll.h> |
| 64 | +int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); |
| 65 | +``` |
| 66 | +**Arguments:** |
| 67 | +
|
| 68 | +- `epfd`: The file descriptor of the epoll instance. |
| 69 | +- `op`: The operation to be performed (e.g., `EPOLL_CTL_ADD`, `EPOLL_CTL_MOD`, `EPOLL_CTL_DEL`). |
| 70 | +- `fd`: The target file descriptor to monitor. |
| 71 | +- `event`: A pointer to an `epoll_event` structure that specifies the events to monitor (e.g., `EPOLLIN`, `EPOLLOUT`) and any user data associated with the file descriptor. |
| 72 | +
|
| 73 | +
|
| 74 | +When you add a file descriptor using `epoll_ctl(..., EPOLL_CTL_ADD, ...)`: |
| 75 | +
|
| 76 | +1. The kernel searches the **Red-Black Tree** to see if the FD is already registered. |
| 77 | +2. If not found, it creates a new entry (an `epitem` structure) and inserts it into the tree. |
| 78 | +
|
| 79 | +```c |
| 80 | +struct epitem { |
| 81 | + struct rb_node rbn; // for red-black tree |
| 82 | + struct list_head rdllink; // for ready list |
| 83 | + struct epoll_filefd ffd; // target fd + file pointer |
| 84 | + struct eventpoll *ep; // back-pointer to eventpoll |
| 85 | + struct epoll_event event; // user's event mask |
| 86 | + struct list_head pwqlist; // poll wait queue links |
| 87 | +}; |
| 88 | +``` |
| 89 | + |
| 90 | +3. Crucially, this `epitem` is registered to the target file's `poll` table via a function `ep_ptable_queue_proc()`. This function is what bridges the gap between the hardware/driver and epoll. |
| 91 | + |
| 92 | +::: tip NOTE |
| 93 | +Every file descriptor in Linux (sockets, pipes, character devices, etc.) exposes a `poll` method through its `file_operations` (f_op->poll). |
| 94 | +::: |
| 95 | + |
| 96 | +4. On successful completion, `epoll_ctl()` returns `0`. |
| 97 | + |
| 98 | +### **The Kernel Callback**: `ep_poll_callback()` |
| 99 | + |
| 100 | +This is the magic glue. |
| 101 | +- Once a file registered with epoll, any time the file’s state changes, e.g., new data arrives (`POLLIN`) or buffer space becomes available (`POLLOUT`), the kernel invokes epoll’s callback: `ep_poll_callback()`. |
| 102 | +- This callback: |
| 103 | + 1. Enqueues the `epitem` into the **ready list** (`rdlist`) of the `eventpoll`. |
| 104 | + 2. Wakes up any thread currently sleeping in `epoll_wait()` on this epoll instance. |
| 105 | + |
| 106 | +### Waiting for Events: `epoll_wait()` |
| 107 | + |
| 108 | +This system call blocks the process until any of the descriptors being monitored becomes ready for I/O events. And the process is woken up when the ready list is non-empty (via the kernel callback `ep_poll_callback()`). |
| 109 | + |
| 110 | +```c |
| 111 | +#include <sys/epoll.h> |
| 112 | +int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); |
| 113 | +``` |
| 114 | +
|
| 115 | +**Arguments:** |
| 116 | +
|
| 117 | +- `epfd`: The file descriptor of the epoll instance. |
| 118 | +- `events`: A pointer to an array of `epoll_event` structures (this array need to be allocated in the user space). The kernel will store the events that occurred in this array. |
| 119 | +- `maxevents`: The maximum number of events to return. This is the size of the `events` array. |
| 120 | +- `timeout`: The maximum time (in milliseconds) to block. If `timeout` is `0`, the call will return immediately. If `timeout` is `-1`, the call will block indefinitely. |
| 121 | +
|
| 122 | +When your process calls `epoll_wait(epfd, events, maxevents, timeout)`, the kernel will execute: |
| 123 | +
|
| 124 | +```c |
| 125 | +SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, int, maxevents, int, timeout) |
| 126 | +``` |
| 127 | + |
| 128 | +#### Internally: |
| 129 | + |
| 130 | +1. **Kernel acquires lock and checks ready list** |
| 131 | + - If the `rdlist` (ready list) is non-empty, events are immediately returned. |
| 132 | + - Otherwise, the process goes to sleep in the `wait queue` (`ep->wq`). |
| 133 | + |
| 134 | +2. **Sleep and wake mechanism** |
| 135 | + - If no events are ready, the process enters a sleep state via the kernel’s scheduler timeout mechanisms. |
| 136 | + - If any of the monitored FDs triggers its `poll` callback, the `ep_poll_callback()` runs. |
| 137 | + |
| 138 | +When `epoll_wait()` wakes up, it iterates over the `rdlist`, copies the corresponding `epoll_event` structures to user space, and then clears or re-queues the `rdlist` depending on the trigger mode (LT or ET). |
| 139 | + |
| 140 | +## Level triggered mode |
| 141 | + |
| 142 | +In level-triggered mode, epoll reports an event as long as any of the file descriptors in the `rdlist` of epoll is ready. If a socket has unread data in its receive buffer, every call to `epoll_wait()` will continue to return it until the data is consumed. |
| 143 | + |
| 144 | +This mode is easier to use and ensures events are not missed, but may generate repeated notifications if the application does not fully drain the file descriptor. |
| 145 | + |
| 146 | +## Edge triggered mode |
| 147 | + |
| 148 | +In edge-triggered mode, epoll reports events only when the readiness state changes (for example, when new data arrives on a socket that was previously empty). Once the event is delivered, epoll will not notify again until another state change occurs. |
| 149 | + |
| 150 | +Because **ET does not repeat events**, the application must read or write until the operation returns `EAGAIN`; otherwise, data may remain unread with no further notifications. |
| 151 | + |
| 152 | +ET reduces unnecessary wakeups and is useful for high-performance servers, but requires more careful programming. This mode is enabled by passing the `EPOLLET` flag when registering the file descriptor with `epoll_ctl()`. |
| 153 | + |
| 154 | +::: tip NOTE |
| 155 | +`EAGAIN` is a common error code returned by non-blocking I/O operations (e.g., `read`, `write`, `recv`, `send`) when the operation cannot be completed immediately without blocking the calling process. In the context of `epoll` with non-blocking sockets, especially in Edge-Triggered mode, receiving `EAGAIN` indicates that there is no more data to read or the write buffer is full, and you should stop attempting the operation until a new event is reported by `epoll_wait()`. |
| 156 | +::: |
| 157 | + |
| 158 | +## The Ready List Lifecycle |
| 159 | + |
| 160 | +Each `epitem` (monitored FD) transitions through three stages: |
| 161 | + |
| 162 | +| Stage | Description | |
| 163 | +| :---- | :---- | |
| 164 | +| **Registered** | In red-black tree, not ready yet | |
| 165 | +| **Ready** | Added to ready list after kernel callback | |
| 166 | +| **Delivered** | Returned by `epoll_wait()`, removed or re-queued | |
| 167 | + |
| 168 | +## Epoll’s Red-Black Tree (Interest List) |
| 169 | + |
| 170 | +The red-black tree (`ep->rbr`) is used to maintain **unique FDs** with O(log n) lookups. |
| 171 | +It ensures: |
| 172 | + |
| 173 | +- No duplicate entries (`EPOLL_CTL_ADD` fails if already present). |
| 174 | +- Fast modification (`EPOLL_CTL_MOD`). |
| 175 | +- Quick removal (`EPOLL_CTL_DEL`). |
| 176 | + |
| 177 | +The tree nodes (`rb_node`) are part of each `epitem`. |
| 178 | +When you delete a file descriptor, the corresponding node is removed and freed. |
| 179 | + |
| 180 | +## The Wait Queue (Process Sleep List) |
| 181 | + |
| 182 | +Each `eventpoll` object has a `wait_queue_head_t wq`. |
| 183 | +This queue holds all processes currently sleeping in `epoll_wait()`. |
| 184 | + |
| 185 | +When `ep_poll_callback()` runs and pushes an event into `rdlist`, it calls: |
| 186 | + |
| 187 | +```c |
| 188 | +wake_up_interruptible(&ep->wq); |
| 189 | +``` |
| 190 | +
|
| 191 | +This triggers a scheduler wakeup for any sleeping threads, causing them to resume execution and return ready events. |
| 192 | +
|
| 193 | +## Recursive Epoll (Nested Instances) |
| 194 | +
|
| 195 | +Linux allows **epoll of epoll** (monitoring another epoll FD). |
| 196 | +The kernel prevents deadlocks and loops by marking epoll files with flags (`EPOLLWAKEUP`, `EPOLLEXCLUSIVE`) and limiting recursion depth. |
| 197 | +
|
| 198 | +This is handled carefully in `fs/eventpoll.c` using checks like: |
| 199 | +
|
| 200 | +```c |
| 201 | +if (is_file_epoll(file)) |
| 202 | + error = -EINVAL; |
| 203 | +``` |
| 204 | + |
| 205 | +unless explicit recursion is enabled. |
| 206 | + |
| 207 | +## Performance and Locking |
| 208 | + |
| 209 | +Epoll uses **fine-grained spinlocks** (`ep->lock`) around its lists and trees. |
| 210 | +Operations are O(1) per event and O(log n) for FD management. |
| 211 | + |
| 212 | +This design ensures: |
| 213 | + |
| 214 | +- Constant-time event delivery. |
| 215 | +- Scalable to tens of thousands of FDs. |
| 216 | +- Lock contention only on wakeups (not on every syscall). |
| 217 | + |
| 218 | +## Important Kernel Functions (for reference) |
| 219 | + |
| 220 | +| Function | Purpose | |
| 221 | +| :---- | :---- | |
| 222 | +| `do_epoll_create()` | Allocates and initializes `eventpoll`. | |
| 223 | +| `ep_insert()` | Inserts new `epitem` into red-black tree. | |
| 224 | +| `ep_remove()` | Removes an `epitem` on `EPOLL_CTL_DEL`. | |
| 225 | +| `ep_poll_callback()` | Called by kernel when FD becomes ready. | |
| 226 | +| `ep_send_events()` | Copies ready events to user space. | |
| 227 | +| `ep_eventpoll_release()` | Cleans up on `close(epfd)` or `exit()`. | |
| 228 | + |
| 229 | +All defined in `fs/eventpoll.c` (Linux source). |
| 230 | + |
| 231 | +## Understanding Readiness Propagation |
| 232 | + |
| 233 | +When a socket receives new data: |
| 234 | + |
| 235 | +1. The network stack updates its `sk_buff` queue. |
| 236 | +2. The socket’s `poll()` function returns `POLLIN`. |
| 237 | +3. Epoll’s registered callback (`ep_poll_callback`) runs. |
| 238 | +4. The corresponding `epitem` moves to `rdlist`. |
| 239 | +5. If a process is sleeping in `epoll_wait()`, it’s woken up. |
| 240 | +6. Events are copied to user space, and control returns to the application. |
| 241 | + |
| 242 | +This cycle is **non-blocking** and **fully asynchronous** — the kernel only wakes up the process when there’s meaningful work to do. |
| 243 | + |
| 244 | +## Closing the Loop |
| 245 | + |
| 246 | +When the application closes a monitored FD: |
| 247 | + |
| 248 | +- The kernel calls `ep_remove()` to unlink its `epitem` from all lists. |
| 249 | +- The callback hooks are detached. |
| 250 | +- Memory is freed safely even if wakeups are pending. |
| 251 | + |
| 252 | +## Summary Table |
| 253 | + |
| 254 | +| Component | Data Structure | Purpose | |
| 255 | +| :---- | :---- | :---- | |
| 256 | +| Interest list | Red-black tree (`rbr`) | Store registered FDs | |
| 257 | +| Ready list | Linked list (`rdlist`) | Store active events | |
| 258 | +| Wait queue | `wait_queue_head_t` | Put sleeping processes | |
| 259 | +| FD node | `epitem` | Connects file to eventpoll | |
| 260 | +| Callback | `ep_poll_callback()` | Moves item to ready list + wakeup | |
| 261 | + |
| 262 | +### Summary of chain of events: |
| 263 | + |
| 264 | +When an application calls `epoll_wait()`, it essentially hands control to the kernel, asking it to monitor all file descriptors that were previously registered through `epoll_ctl()`. Inside the kernel, each epoll instance is represented by an eventpoll object, which contains three key components — a red-black tree (holding all registered file descriptors), a ready list (containing file descriptors that currently have pending I/O events), and a wait queue (where user processes sleep when there are no ready events). When `epoll_wait()` is invoked, if the ready list is empty, the calling process is put to sleep on the wait queue. Meanwhile, every file descriptor (socket, pipe, etc.) in the system maintains its own internal `poll table`, a structure that records which epoll instances are interested in its state changes. When data arrives or an I/O state changes on any of those file descriptors, the kernel triggers the registered callback `ep_poll_callback()`. This callback runs in interrupt or softirq context, adds the corresponding `epitem` (representing that FD) to the eventpoll’s ready list, and then wakes up any processes sleeping on the epoll’s wait queue. Once the sleeping process wakes, `epoll_wait()` copies the list of ready events from the kernel’s ready list into user-space memory and returns control to the application with a list of file descriptors that are ready for I/O. |
| 265 | + |
| 266 | +Thus, the sequence forms a complete chain: the application waits → the kernel monitors the interest list → a file event triggers the callback → the ready list is updated → the process is woken up → and finally, ready events are delivered back to the user space. |
0 commit comments