Java YoungGC的一次排查优化
问题发现
线上报警发生频繁young gc
初步排查
频繁ygc很快就结束了,持续几十秒。没有dump日志或者统计信息可看。
初步排查,是最近刚上线一个功能,10点整有外部定时任务触发的接口调用。
构造数据复现
少量数据调用无法有效复现ygc问题。
构造足够量数据,触发接口。
gc情况查看
- gc查看
进程id
# jps -l
gc情况查看
# /usr/local/java8/bin/jstat -gc -h20 753 100
每100ms打印一次gc,可以看出,每经历ygc一次后EU变化很大,直接变很小。并且Survivor和老年代几乎没变化,这证明很多是不可达对象。
[user@hostname ~]$ /usr/local/java8/bin/jstat -gc -h20 753 100
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
0.0 4096.0 0.0 4096.0 2637824.0 2293760.0 1552384.0 167428.9 250880.0 242331.1 25344.0 23708.4 310 6.166 3 2.489 8.656
0.0 8192.0 0.0 8192.0 2633728.0 20480.0 1552384.0 165546.6 250880.0 242331.1 25344.0 23708.4 311 6.175 3 2.489 8.664
...
0.0 4096.0 0.0 4096.0 2637824.0 73728.0 1552384.0 167421.8 250880.0 242331.1 25344.0 23708.4 312 6.183 3 2.489 8.672
...
每1s打印一次,ygc列变化很猛,几乎1s一次ygc
[user@hostname ~]$ /usr/local/java8/bin/jstat -gc -h20 753 1000
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
0.0 20480.0 0.0 20480.0 2621440.0 1216512.0 1552384.0 273092.5 250880.0 244531.7 25344.0 23985.1 184 5.009 0 0.000 5.009
0.0 20480.0 0.0 20480.0 2621440.0 1409024.0 1552384.0 273092.5 250880.0 244533.9 25344.0 23985.1 185 5.022 0 0.000 5.022
0.0 20480.0 0.0 20480.0 2621440.0 1593344.0 1552384.0 273092.5 250880.0 244533.9 25344.0 23985.1 186 5.036 0 0.000 5.036
0.0 16384.0 0.0 16384.0 2625536.0 1806336.0 1552384.0 275140.5 250880.0 244533.9 25344.0 23985.1 187 5.050 0 0.000 5.050
0.0 20480.0 0.0 20480.0 2621440.0 2023424.0 1552384.0 273092.5 250880.0 244533.9 25344.0 23985.1 188 5.063 0 0.000 5.063
0.0 16384.0 0.0 16384.0 2625536.0 2236416.0 1552384.0 275140.5 250880.0 244533.9 25344.0 23985.1 189 5.076 0 0.000 5.076
...
- gc日志查看Eden变化,以及堆内存配置信息
[Times: user=0.02 sys=0.00, real=0.01 secs] 2022-05-30T21:39:56.393+0800: 3481.247: Total time for which application threads were stopped: 0.0097174 seconds, Stopping threads took: 0.0001711 seconds {Heap before GC invocations=92 (full 0): garbage-first heap total 4194304K, used 2785744K [0x00000006c0000000, 0x00000006c0402000, 0x00000007c0000000) region size 4096K, 614 young (2514944K), 1 survivors (4096K) Metaspace used 238654K, capacity 244949K, committed 245248K, reserved 1269760K
class space used 23768K, capacity 25007K, committed 25088K, reserved 1048576K
2022-05-30T21:39:57.351+0800: 3482.205: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 161480704 bytes, new threshold 15 (max 15)
- age 1: 303376 bytes, 303376 total
- age 2: 178408 bytes, 481784 total
- age 3: 33136 bytes, 514920 total
- age 4: 2018600 bytes, 2533520 total
- age 5: 2992 bytes, 2536512 total
- age 6: 3448 bytes, 2539960 total
- age 7: 9288 bytes, 2549248 total
- age 8: 7840 bytes, 2557088 total
- age 9: 8480 bytes, 2565568 total
- age 10: 5992 bytes, 2571560 total
- age 11: 3408 bytes, 2574968 total
- age 12: 3464 bytes, 2578432 total
- age 13: 6704 bytes, 2585136 total
- age 14: 5904 bytes, 2591040 total
- age 15: 4704 bytes, 2595744 total
3482.205: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 1376, predicted base time: 5.80 ms, remaining time: 94.20 ms, target pause time: 100.00 ms]
3482.205: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 613 regions, survivors: 1 regions, predicted young region time: 2.52 ms]
3482.205: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 613 regions, survivors: 1 regions, old: 0 regions, predicted pause time: 8.32 ms, target pause time: 100.00 ms]
, 0.0105895 secs]
[Parallel Time: 7.4 ms, GC Workers: 4]
[GC Worker Start (ms): Min: 3482205.3, Avg: 3482205.3, Max: 3482205.3, Diff: 0.0]
[Ext Root Scanning (ms): Min: 3.7, Avg: 3.9, Max: 4.2, Diff: 0.6, Sum: 15.6]
[Update RS (ms): Min: 0.7, Avg: 0.7, Max: 0.7, Diff: 0.0, Sum: 2.7]
[Processed Buffers: Min: 21, Avg: 26.5, Max: 31, Diff: 10, Sum: 106]
[Scan RS (ms): Min: 0.1, Avg: 0.2, Max: 0.2, Diff: 0.0, Sum: 0.6]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Object Copy (ms): Min: 2.2, Avg: 2.4, Max: 2.7, Diff: 0.5, Sum: 9.8]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.3]
[GC Worker Total (ms): Min: 7.2, Avg: 7.3, Max: 7.3, Diff: 0.1, Sum: 29.1]
[GC Worker End (ms): Min: 3482212.6, Avg: 3482212.6, Max: 3482212.6, Diff: 0.1]
[Code Root Fixup: 0.2 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.3 ms]
[Other: 2.7 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 0.8 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.0 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.9 ms]
[Eden: 2452.0M(2452.0M)->0.0B(2448.0M) Survivors: 4096.0K->8192.0K Heap: 2720.5M(4096.0M)->270.0M(4096.0M)]
Heap after GC invocations=93 (full 0):
garbage-first heap total 4194304K, used 276440K [0x00000006c0000000, 0x00000006c0402000, 0x00000007c0000000)
region size 4096K, 2 young (8192K), 2 survivors (8192K)
Metaspace used 238654K, capacity 244949K, committed 245248K, reserved 1269760K
class space used 23768K, capacity 25007K, committed 25088K, reserved 1048576K
}
# /usr/local/java8/bin/jmap -heap 28135
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 4294967296 (4096.0MB)
NewSize = 1363144 (1.2999954223632812MB)
MaxNewSize = 2575302656 (2456.0MB)
OldSize = 5452592 (5.1999969482421875MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 536870912 (512.0MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 536870912 (512.0MB)
G1HeapRegionSize = 4194304 (4.0MB)
Heap Usage:
G1 Heap:
regions = 1024
capacity = 4294967296 (4096.0MB)
used = 1558183928 (1485.9999923706055MB)
free = 2736783368 (2610.0000076293945MB)
36.279296688735485% used
G1 Young Generation:
Eden Space:
regions = 332
capacity = 2646605824 (2524.0MB)
used = 1392508928 (1328.0MB)
free = 1254096896 (1196.0MB)
52.614896988906494% used
Survivor Space:
regions = 14
capacity = 58720256 (56.0MB)
used = 58720256 (56.0MB)
free = 0 (0.0MB)
100.0% used
G1 Old Generation:
regions = 26
capacity = 1589641216 (1516.0MB)
used = 106954744 (101.99999237060547MB)
free = 1482686472 (1414.0000076293945MB)
6.7282316867154% used
73234 interned Strings occupying 8925720 bytes.
- 查看类和对象信息
# /usr/local/java8/bin/jmap -histo 753 |more
java.util.ArrayList$Itr 迭代器占用很大内存
[user@hostname ~]$ /usr/local/java8/bin/jmap -histo 753|more
num #instances #bytes class name
----------------------------------------------
1: 52531168 1680997376 java.util.ArrayList$Itr
2: 615656 75433192 [C
3: 79558 57498864 [I
4: 610640 14655360 java.lang.String
5: 140349 12350712 java.lang.reflect.Method
6: 41974 12177704 [B
7: 348980 11167360 java.util.concurrent.ConcurrentHashMap$Node
8: 275265 11010600 com.xx.ConfigValue$StatisticsItem
9: 96029 10628064 [Ljava.lang.Object;
10: 178479 7139160 java.util.LinkedHashMap$Entry
# /usr/local/java8/bin/jmap -histo:live 753
live排前列的没有java.util.ArrayList$Itr
[user@hostname ~]$ /usr/local/java8/bin/jmap -histo:live 753|more
num #instances #bytes class name
----------------------------------------------
1: 406510 44790016 [C
2: 84266 10119320 [Ljava.lang.Object;
3: 403347 9680328 java.lang.String
4: 300603 9619296 java.util.concurrent.ConcurrentHashMap$Node
5: 18768 8838376 [B
6: 94751 8338088 java.lang.reflect.Method
7: 15598 5545056 [I
8: 36284 4222656 java.lang.Class
9: 96948 3877920 com.xx.ConfigValue$StatisticsItem
10: 93357 3734280 java.util.LinkedHashMap$Entry
dump日志查看
- dump 因为ygc很频繁,可能dump时刚走了gc,多dump几次,取文件比较大的
# /usr/local/java8/bin/jmap -dump:format=b,file=wws-heap.hprof 50334
- 下载到本地
# python -m SimpleHTTPServer 8231
电脑访问:
http://remoteIp:8231
- MAT查看
n多个不可达的迭代器ArrayList$Itr对象,地址引用高频连续。迭代器引用的外部List是同一个地址,内部元素也是同一个地址。
第三方组件排查
- 发现可能是第三方组件的问题,升级版本后仍然ygc,并继续精简问题代码
@Override
public void moveXX(XxBO xxBO) {
self.doJob("mytest");
}
@第三方依赖注解(domain = "domainName", retryFor = Exception.class)
@Override
public void doJob(String str) {
try {
TimeUnit.SECONDS.sleep(40);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
继续看gc情况,发现确实只要该任务耗时长,就会频繁ygc。
- 查看频繁生成ArrayList$Itr 的线程信息 找最占用CPU的线程
1. 查看进程id
# jps -l
2. 查看线程id,按H键 查看所有线程
# top -p 47445
3. 转16进制
# printf '%x\n' 47980
4. 查看线程信息
# /usr/local/java8/bin/jstack 47445 | grep -A 20 `printf '%x\n' 47980`
[user@hostname ~]$ /usr/local/java8/bin/jstack 47445 | grep -A 20 bb6c
"retry_task_scanner-0" #404 daemon prio=5 os_prio=0 tid=0x00007fefa805cd60 nid=0xbb6c runnable [0x00007fef17158000]
java.lang.Thread.State: RUNNABLE
at com.xx.executor.ScanTask.waitTasksComplete(ScanTask.java:111)
at com.xx.executor.ScanTask.run(ScanTask.java:73)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
- 排查第三方组件耗cpu线程的那段代码
protected void waitTasksComplete(List<Future<?>> tasks) {
for (; ; ) {
int finishCount = 0;
for (Future<?> futureTask : tasks) {
if (futureTask.isDone() || futureTask.isCancelled()) {
finishCount += 1;
}
}
if (finishCount == tasks.size()) {
break;
}
}
}
这里外层有无限循环,内层有迭代器ArrayList$Itr。很符合上述gc排查,即迭代器实例数涨势很猛的情况。
增加睡眠时间代码,让外层循环慢一点,发现果然没有ygc了。证明确实是这块问题。
protected void waitTasksComplete(List<Future<?>> tasks) {
for (; ; ) {
int finishCount = 0;
for (Future<?> futureTask : tasks) {
if (futureTask.isDone() || futureTask.isCancelled()) {
finishCount += 1;
}
}
if (finishCount == tasks.size()) {
break;
} else {// 增加睡眠,让外层循环慢一点
try {
TimeUnit.MILLISECONDS.sleep(200);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
- 实例验证迭代器这样用法导致ygc
List<Long> lists = Lists.newArrayList(1L);
for (; ; ) {
for (Long p : lists) {
if (p.equals(2)) {
System.out.println(1);
}
}
}
ygc查看猛的一批。dump日志查看,很多迭代器不可达对象和以上案例一样。
替换成fori 循环
List<Long> lists = Lists.newArrayList(1L);
for (; ; ) {
for (int i = 0; i < lists.size(); i++) {
if (lists.get(i).equals(2)) {
System.out.println(1);
}
}
}
没有gc情况出现。
- 第三方组件代码也替换成for i循环验证。
验证方式:建同样的包名和类,覆盖第三方类。类寻找优先。
protected void waitTasksComplete(List<Future<?>> tasks) {
for (; ; ) {
int finishCount = 0;
for (int i = 0; i < tasks.size(); i++) {
if (tasks.get(i).isDone() || tasks.get(i).isCancelled()) {
finishCount += 1;
}
}
if (finishCount == tasks.size()) {
break;
}
}
}