Skip to content

Commit 8c1fd3a

Browse files
authored
Merge pull request #4 from Shibin-Ez/master
added new epoll documentation
2 parents af32a33 + 92e61d6 commit 8c1fd3a

File tree

2 files changed

+270
-0
lines changed

2 files changed

+270
-0
lines changed

docs/.vitepress/config.mjs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,10 @@ export default defineConfig({
5353
text: 'Linux epoll tutorial',
5454
link: '/guides/resources/linux-epoll-tutorial',
5555
},
56+
{
57+
text: 'Linux epoll',
58+
link: '/guides/resources/linux-epoll',
59+
},
5660
{
5761
text: 'Blocking & Non-Blocking Sockets',
5862
link: '/guides/resources/blocking-and-non-blocking-sockets',
Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
# Linux Epoll
2+
3+
## Overview
4+
5+
Epoll is not just a user-space API, it’s a kernel subsystem that maintains a persistent, event-driven mechanism to monitor I/O readiness of multiple file descriptors.
6+
7+
Internally, epoll revolves around a kernel object called `eventpoll`.
8+
9+
### Key Components of `eventpoll` Object
10+
11+
#### 1. The Wait Queue (`wq`)
12+
This queue holds the list of processes (threads) that are currently blocked in `epoll_wait()`, waiting for an event to occur. When an event happens, the kernel wakes up the processes in this queue.
13+
14+
#### 2. The Ready List (`rdlist`)
15+
This is a **doubly linked list** that stores the file descriptors (FDs) that are currently "ready" (i.e., have data to read or space to write).
16+
- When an event occurs on a monitored FD, it is added to this list.
17+
- `epoll_wait()` simply checks this list. If it's not empty, it returns the events to the user.
18+
19+
#### 3. The Red-Black Tree (`rbr`)
20+
This is a **Red-Black Tree** that stores all the file descriptors currently being monitored by this epoll instance.
21+
- It allows for efficient **insertion, deletion, and search** of file descriptors (O(log n)).
22+
- When you call `epoll_ctl()` to add or remove an FD, the kernel modifies this tree.
23+
24+
## How it works: The Lifecycle
25+
26+
### **Creating an Epoll instance**: `epoll_create()`
27+
28+
```c
29+
int epoll_create1(int flags)
30+
```
31+
32+
Here the flags can be either `0` or `EPOLL_CLOEXEC`
33+
34+
If the flag is set to zero, this function behaves like the older `epoll_create()` API, but without its obsolete size argument (as size is now dynamically managed by kernel).
35+
36+
If the flag is set to `EPOLL_CLOEXEC`, it will inform the kernel to set the close-on-exec (`FD_CLOEXEC`) flag on this epoll file descriptor. So if your process later calls `exec()` (e.g., `execvp()` after a `fork()`), the epoll FD will automatically be closed in the child process. This prevents leaking file descriptors into executed programs.
37+
38+
When you call `epoll_create1(0)`, the kernel allocates a new `eventpoll` object in kernel memory. This object is the heart of the epoll instance and contains the following key members:
39+
40+
```c
41+
struct eventpoll {
42+
/* Wait queue for sys_epoll_wait() */
43+
wait_queue_head_t wq;
44+
45+
/* List of ready file descriptors */
46+
struct list_head rdlist;
47+
48+
/* Red-black tree root used to store monitored file descriptors */
49+
struct rb_root_cached rbr;
50+
51+
/* Lock for protecting the structure */
52+
spinlock_t lock;
53+
54+
/* ... other fields ... */
55+
};
56+
```
57+
58+
Finally, a file descriptor is returned to the user-space referencing this `eventpoll` object. Applications can use this descriptor with `epoll_ctl()` and `epoll_wait()` to manage monitored file descriptors and retrieve events.
59+
60+
### **Registering a File Descriptor with Epoll**: `epoll_ctl()`
61+
62+
```c
63+
#include<sys/epoll.h>
64+
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
65+
```
66+
**Arguments:**
67+
68+
- `epfd`: The file descriptor of the epoll instance.
69+
- `op`: The operation to be performed (e.g., `EPOLL_CTL_ADD`, `EPOLL_CTL_MOD`, `EPOLL_CTL_DEL`).
70+
- `fd`: The target file descriptor to monitor.
71+
- `event`: A pointer to an `epoll_event` structure that specifies the events to monitor (e.g., `EPOLLIN`, `EPOLLOUT`) and any user data associated with the file descriptor.
72+
73+
74+
When you add a file descriptor using `epoll_ctl(..., EPOLL_CTL_ADD, ...)`:
75+
76+
1. The kernel searches the **Red-Black Tree** to see if the FD is already registered.
77+
2. If not found, it creates a new entry (an `epitem` structure) and inserts it into the tree.
78+
79+
```c
80+
struct epitem {
81+
struct rb_node rbn; // for red-black tree
82+
struct list_head rdllink; // for ready list
83+
struct epoll_filefd ffd; // target fd + file pointer
84+
struct eventpoll *ep; // back-pointer to eventpoll
85+
struct epoll_event event; // user's event mask
86+
struct list_head pwqlist; // poll wait queue links
87+
};
88+
```
89+
90+
3. Crucially, this `epitem` is registered to the target file's `poll` table via a function `ep_ptable_queue_proc()`. This function is what bridges the gap between the hardware/driver and epoll.
91+
92+
::: tip NOTE
93+
Every file descriptor in Linux (sockets, pipes, character devices, etc.) exposes a `poll` method through its `file_operations` (f_op->poll).
94+
:::
95+
96+
4. On successful completion, `epoll_ctl()` returns `0`.
97+
98+
### **The Kernel Callback**: `ep_poll_callback()`
99+
100+
This is the magic glue.
101+
- Once a file registered with epoll, any time the file’s state changes, e.g., new data arrives (`POLLIN`) or buffer space becomes available (`POLLOUT`), the kernel invokes epoll’s callback: `ep_poll_callback()`.
102+
- This callback:
103+
1. Enqueues the `epitem` into the **ready list** (`rdlist`) of the `eventpoll`.
104+
2. Wakes up any thread currently sleeping in `epoll_wait()` on this epoll instance.
105+
106+
### Waiting for Events: `epoll_wait()`
107+
108+
This system call blocks the process until any of the descriptors being monitored becomes ready for I/O events. And the process is woken up when the ready list is non-empty (via the kernel callback `ep_poll_callback()`).
109+
110+
```c
111+
#include <sys/epoll.h>
112+
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
113+
```
114+
115+
**Arguments:**
116+
117+
- `epfd`: The file descriptor of the epoll instance.
118+
- `events`: A pointer to an array of `epoll_event` structures (this array need to be allocated in the user space). The kernel will store the events that occurred in this array.
119+
- `maxevents`: The maximum number of events to return. This is the size of the `events` array.
120+
- `timeout`: The maximum time (in milliseconds) to block. If `timeout` is `0`, the call will return immediately. If `timeout` is `-1`, the call will block indefinitely.
121+
122+
When your process calls `epoll_wait(epfd, events, maxevents, timeout)`, the kernel will execute:
123+
124+
```c
125+
SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, int, maxevents, int, timeout)
126+
```
127+
128+
#### Internally:
129+
130+
1. **Kernel acquires lock and checks ready list**
131+
- If the `rdlist` (ready list) is non-empty, events are immediately returned.
132+
- Otherwise, the process goes to sleep in the `wait queue` (`ep->wq`).
133+
134+
2. **Sleep and wake mechanism**
135+
- If no events are ready, the process enters a sleep state via the kernel’s scheduler timeout mechanisms.
136+
- If any of the monitored FDs triggers its `poll` callback, the `ep_poll_callback()` runs.
137+
138+
When `epoll_wait()` wakes up, it iterates over the `rdlist`, copies the corresponding `epoll_event` structures to user space, and then clears or re-queues the `rdlist` depending on the trigger mode (LT or ET).
139+
140+
## Level triggered mode
141+
142+
In level-triggered mode, epoll reports an event as long as any of the file descriptors in the `rdlist` of epoll is ready. If a socket has unread data in its receive buffer, every call to `epoll_wait()` will continue to return it until the data is consumed.
143+
144+
This mode is easier to use and ensures events are not missed, but may generate repeated notifications if the application does not fully drain the file descriptor.
145+
146+
## Edge triggered mode
147+
148+
In edge-triggered mode, epoll reports events only when the readiness state changes (for example, when new data arrives on a socket that was previously empty). Once the event is delivered, epoll will not notify again until another state change occurs.
149+
150+
Because **ET does not repeat events**, the application must read or write until the operation returns `EAGAIN`; otherwise, data may remain unread with no further notifications.
151+
152+
ET reduces unnecessary wakeups and is useful for high-performance servers, but requires more careful programming. This mode is enabled by passing the `EPOLLET` flag when registering the file descriptor with `epoll_ctl()`.
153+
154+
::: tip NOTE
155+
`EAGAIN` is a common error code returned by non-blocking I/O operations (e.g., `read`, `write`, `recv`, `send`) when the operation cannot be completed immediately without blocking the calling process. In the context of `epoll` with non-blocking sockets, especially in Edge-Triggered mode, receiving `EAGAIN` indicates that there is no more data to read or the write buffer is full, and you should stop attempting the operation until a new event is reported by `epoll_wait()`.
156+
:::
157+
158+
## The Ready List Lifecycle
159+
160+
Each `epitem` (monitored FD) transitions through three stages:
161+
162+
| Stage | Description |
163+
| :---- | :---- |
164+
| **Registered** | In red-black tree, not ready yet |
165+
| **Ready** | Added to ready list after kernel callback |
166+
| **Delivered** | Returned by `epoll_wait()`, removed or re-queued |
167+
168+
## Epoll’s Red-Black Tree (Interest List)
169+
170+
The red-black tree (`ep->rbr`) is used to maintain **unique FDs** with O(log n) lookups.
171+
It ensures:
172+
173+
- No duplicate entries (`EPOLL_CTL_ADD` fails if already present).
174+
- Fast modification (`EPOLL_CTL_MOD`).
175+
- Quick removal (`EPOLL_CTL_DEL`).
176+
177+
The tree nodes (`rb_node`) are part of each `epitem`.
178+
When you delete a file descriptor, the corresponding node is removed and freed.
179+
180+
## The Wait Queue (Process Sleep List)
181+
182+
Each `eventpoll` object has a `wait_queue_head_t wq`.
183+
This queue holds all processes currently sleeping in `epoll_wait()`.
184+
185+
When `ep_poll_callback()` runs and pushes an event into `rdlist`, it calls:
186+
187+
```c
188+
wake_up_interruptible(&ep->wq);
189+
```
190+
191+
This triggers a scheduler wakeup for any sleeping threads, causing them to resume execution and return ready events.
192+
193+
## Recursive Epoll (Nested Instances)
194+
195+
Linux allows **epoll of epoll** (monitoring another epoll FD).
196+
The kernel prevents deadlocks and loops by marking epoll files with flags (`EPOLLWAKEUP`, `EPOLLEXCLUSIVE`) and limiting recursion depth.
197+
198+
This is handled carefully in `fs/eventpoll.c` using checks like:
199+
200+
```c
201+
if (is_file_epoll(file))
202+
error = -EINVAL;
203+
```
204+
205+
unless explicit recursion is enabled.
206+
207+
## Performance and Locking
208+
209+
Epoll uses **fine-grained spinlocks** (`ep->lock`) around its lists and trees.
210+
Operations are O(1) per event and O(log n) for FD management.
211+
212+
This design ensures:
213+
214+
- Constant-time event delivery.
215+
- Scalable to tens of thousands of FDs.
216+
- Lock contention only on wakeups (not on every syscall).
217+
218+
## Important Kernel Functions (for reference)
219+
220+
| Function | Purpose |
221+
| :---- | :---- |
222+
| `do_epoll_create()` | Allocates and initializes `eventpoll`. |
223+
| `ep_insert()` | Inserts new `epitem` into red-black tree. |
224+
| `ep_remove()` | Removes an `epitem` on `EPOLL_CTL_DEL`. |
225+
| `ep_poll_callback()` | Called by kernel when FD becomes ready. |
226+
| `ep_send_events()` | Copies ready events to user space. |
227+
| `ep_eventpoll_release()` | Cleans up on `close(epfd)` or `exit()`. |
228+
229+
All defined in `fs/eventpoll.c` (Linux source).
230+
231+
## Understanding Readiness Propagation
232+
233+
When a socket receives new data:
234+
235+
1. The network stack updates its `sk_buff` queue.
236+
2. The socket’s `poll()` function returns `POLLIN`.
237+
3. Epoll’s registered callback (`ep_poll_callback`) runs.
238+
4. The corresponding `epitem` moves to `rdlist`.
239+
5. If a process is sleeping in `epoll_wait()`, it’s woken up.
240+
6. Events are copied to user space, and control returns to the application.
241+
242+
This cycle is **non-blocking** and **fully asynchronous** — the kernel only wakes up the process when there’s meaningful work to do.
243+
244+
## Closing the Loop
245+
246+
When the application closes a monitored FD:
247+
248+
- The kernel calls `ep_remove()` to unlink its `epitem` from all lists.
249+
- The callback hooks are detached.
250+
- Memory is freed safely even if wakeups are pending.
251+
252+
## Summary Table
253+
254+
| Component | Data Structure | Purpose |
255+
| :---- | :---- | :---- |
256+
| Interest list | Red-black tree (`rbr`) | Store registered FDs |
257+
| Ready list | Linked list (`rdlist`) | Store active events |
258+
| Wait queue | `wait_queue_head_t` | Put sleeping processes |
259+
| FD node | `epitem` | Connects file to eventpoll |
260+
| Callback | `ep_poll_callback()` | Moves item to ready list + wakeup |
261+
262+
### Summary of chain of events:
263+
264+
When an application calls `epoll_wait()`, it essentially hands control to the kernel, asking it to monitor all file descriptors that were previously registered through `epoll_ctl()`. Inside the kernel, each epoll instance is represented by an eventpoll object, which contains three key components — a red-black tree (holding all registered file descriptors), a ready list (containing file descriptors that currently have pending I/O events), and a wait queue (where user processes sleep when there are no ready events). When `epoll_wait()` is invoked, if the ready list is empty, the calling process is put to sleep on the wait queue. Meanwhile, every file descriptor (socket, pipe, etc.) in the system maintains its own internal `poll table`, a structure that records which epoll instances are interested in its state changes. When data arrives or an I/O state changes on any of those file descriptors, the kernel triggers the registered callback `ep_poll_callback()`. This callback runs in interrupt or softirq context, adds the corresponding `epitem` (representing that FD) to the eventpoll’s ready list, and then wakes up any processes sleeping on the epoll’s wait queue. Once the sleeping process wakes, `epoll_wait()` copies the list of ready events from the kernel’s ready list into user-space memory and returns control to the application with a list of file descriptors that are ready for I/O.
265+
266+
Thus, the sequence forms a complete chain: the application waits → the kernel monitors the interest list → a file event triggers the callback → the ready list is updated → the process is woken up → and finally, ready events are delivered back to the user space.

0 commit comments

Comments
 (0)