0%

JDK空轮询Bug出现原因及解决方法

什么是JDK空轮询Bug

JDK NIO的空轮询BUG其实是JDK NIO在Linux系统下的epoll空轮询问题。

epoll是Linux下一种高效的IO复用方式,相较于select和poll机制来说。其高效的原因是将基于事件的fd放到内核中来完成,在内核中基于红黑树+链表数据结构来实现,链表存放有事件发生的fd集合,然后在调用epoll_wait时返回给应用程序,由应用程序来处理这些fd事件。

使用IO多路复用,Linux下一般默认就是epoll,Java NIO在Linux下默认也是epoll机制,但是JDK中epoll的实现却是有漏洞的。其中一个就是Epoll的空轮询Bug, 就是即使是关注的select轮询事件返回数量为0,NIO照样不断的从select本应该阻塞的Selector.select()/Selector.select(timeout)中wake up出来,导致CPU飙到100%问题。

官方给的Bug复现方法:

A DESCRIPTION OF THE PROBLEM :
The NIO selector wakes up infinitely in this situation..
0. server waits for connection

  1. client connects and write message
  2. server accepts and register OP_READ
  3. server reads message and remove OP_READ from interest op set
  4. client close the connection
  5. server write message (without any reading.. surely OP_READ is not set)
  6. server’s select wakes up infinitely with return value 0

产生这一Bug的原因:

因为poll和epoll对于突然中断的连接socket会对返回的eventSet事件集合置为EPOLLHUP或者EPOLLERR,eventSet事件集合发生了变化,这就导致Selector会被唤醒,如果仅仅是因为这个原因唤醒且没有感兴趣的时间发生的话,就会变成空轮询。

epoll感兴趣的事件集合

符号 描述
EPOLLIN 表示对应的文件描述符可以读(包括对端SOCKET正常关闭)
EPOLLOUT 表示对应的文件描述符可以写;
EPOLLPRI 表示对应的文件描述符有紧急的数据可读(这里应该表示有带外数据到来);
EPOLLERR 表示对应的文件描述符发生错误;
EPOLLHUP 表示对应的文件描述符被挂断;
EPOLLET 将 EPOLL设为边缘触发(Edge Triggered)模式(默认为水平触发),这是相对水平触发(Level Triggered)来说的。
EPOLLONESHOT 只监听一次事件,当监听完这次事件之后,如果还需要继续监听这个socke的话,需要再次把这个socket加入到EPOLL队列里

解决这一Bug的方法

[JDK-6403933]提到了几个解决方案,这里就只说一下netty的解决方式吧…

就是通过记录空轮询次数来判断是否发生了空轮询(Netty默认是512次),若发生空轮询则重建Selector.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
private void select(boolean oldWakenUp) throws IOException {
Selector selector = this.selector;
try {
// selectCnt记录轮询次数, 空轮询次数超过SELECTOR_AUTO_REBUILD_THRESHOLD(默认512)之后,
// 则重建selector
int selectCnt = 0;
// 记录当前事件
long currentTimeNanos = System.nanoTime();
// selectDeadLineNanos = 当前时间 + 距离最早的定时任务开始执行的时间
// 计算出select操作必须在哪个时间点之前被wakeUp (不然一直被阻塞的话,定时任务就没发被执行)
long selectDeadLineNanos = currentTimeNanos + delayNanos(currentTimeNanos);

long normalizedDeadlineNanos = selectDeadLineNanos - initialNanoTime();
if (nextWakeupTime != normalizedDeadlineNanos) {
nextWakeupTime = normalizedDeadlineNanos;
}

for (;;) {
// 计算出当前select操作能阻塞的最久时间
long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + 500000L) / 1000000L;
// 超过最长等待时间:有定时task需要执行
if (timeoutMillis <= 0) {
if (selectCnt == 0) {
//非阻塞,没有数据返回0
selector.selectNow();
selectCnt = 1;
}
break;
}

// If a task was submitted when wakenUp value was true, the task didn't get a chance to call
// Selector#wakeup. So we need to check task queue again before executing select operation.
// If we don't, the task might be pended until select operation was timed out.
// It might be pended until idle timeout if IdleStateHandler existed in pipeline.
// 确定当前确实没有任务需要去执行
if (hasTasks() && wakenUp.compareAndSet(false, true)) {
selector.selectNow();
selectCnt = 1;
break;
}

// 进行select操作, 下面select阻塞中,别人唤醒也可以可以的
int selectedKeys = selector.select(timeoutMillis);
selectCnt ++;

if (selectedKeys != 0 || oldWakenUp || wakenUp.get() || hasTasks() || hasScheduledTasks()) {
// - Selected something,
// - waken up by user, or
// - the task queue has a pending task.
// - a scheduled task is ready for processing
break;
}

// 如果select没有触发超时返回,并且确实是监听到了新事件而不是空轮询,那么就一定会在上面的if中返回了
// 所以往下走的话,有2个情况:
// 1. select超时
// 2. 发生了空轮询

if (Thread.interrupted()) {
// Thread was interrupted so reset selected keys and break so we not run into a busy loop.
// As this is most likely a bug in the handler of the user or it's client library we will
// also log it.
//
// See https://github.com/netty/netty/issues/2426
if (logger.isDebugEnabled()) {
logger.debug("Selector.select() returned prematurely because " +
"Thread.currentThread().interrupt() was called. Use " +
"NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.");
}
selectCnt = 1;
break;
}


long time = System.nanoTime();
// select超时的情况(因为实际经过的时间确实是 >= 应该最大阻塞时间 )
if (time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos) {
// timeoutMillis elapsed without anything selected.
selectCnt = 1;
} else if (SELECTOR_AUTO_REBUILD_THRESHOLD > 0 &&
selectCnt >= SELECTOR_AUTO_REBUILD_THRESHOLD) {
// 空轮询次数超过了 SELECTOR_AUTO_REBUILD_THRESHOLD(默认512)
// The code exists in an extra method to ensure the method is not too big to inline as this
// branch is not very likely to get hit very frequently.

// 重建selector
selector = selectRebuildSelector(selectCnt);
selectCnt = 1;
break;
}

currentTimeNanos = time;
}

if (selectCnt > MIN_PREMATURE_SELECTOR_RETURNS) {
if (logger.isDebugEnabled()) {
logger.debug("Selector.select() returned prematurely {} times in a row for Selector {}.",
selectCnt - 1, selector);
}
}
} catch (CancelledKeyException e) {
if (logger.isDebugEnabled()) {
logger.debug(CancelledKeyException.class.getSimpleName() + " raised by a Selector {} - JDK bug?",
selector, e);
}
// Harmless exception - log anyway
}
}