×
Community Blog Alibaba Cloud OS Console: Process Hotspot Tracing for Efficiently Resolving Performance Bottlenecks and Jitters

Alibaba Cloud OS Console: Process Hotspot Tracing for Efficiently Resolving Performance Bottlenecks and Jitters

This article provides efficient solutions for business pain points like "service exceptions caused by process performance bottlenecks" along with practical cases.

By Shuyi Cheng

Note: SysOM is an O&M component of the Alibaba Cloud OS console.

1. Background

Process hotspots refer to specific processes or certain parts of a process (such as functions, code segments, and threads) that occupy significant system resources (such as CPU time, memory, and disk I/O), or execute at an exceptionally high frequency, thereby becoming a major area for system performance bottlenecks or resource consumption. It is an important concept in performance analysis and optimization. It helps developers and O&M personnel quickly locate critical problem areas in the system.

Process hotspot tracing is a key concept in performance analysis. Performance analysis tools and visualization (such as flame graphs) can quickly locate performance bottlenecks and resource consumption hotspots in the system, thus providing strong support for optimization and troubleshooting. In the modern complex system environment, mastering the skills of process hotspot analysis is essential to improve system performance and stability.

2. Business Pain Point Analysis

2.1 Pain Point 1: Service Exceptions Caused by Process Performance Bottlenecks

In modern complex cloud-native and containerized environments, business systems face numerous challenges caused by performance bottlenecks. On one hand, process performance bottlenecks may lead to a significant increase in system response time, especially in high-concurrency scenarios, where slowed request processing directly affects user experience. On the other hand, some processes occupy a large amount of system resources (such as CPU and memory), resulting in high system loads and even service unavailability. Process performance bottlenecks are undoubtedly one of the key factors behind service exceptions. Process hotspot tracing technology effectively locates and solves these issues. By generating call graphs and performing hotspot analysis, this technology visually highlights the areas with high resource consumption, helping developers and O&M personnel quickly pinpoint performance bottlenecks.

2.2 Pain Point 2: Sporadic Jitters with Difficulty to Trace Root Causes

Online issues are often sporadic or transient. When an issue occurs, by the time manually execute diagnostic tools such as perf, you may have missed the optimal diagnostic opportunity, making it impossible to capture the exact system state during the incident. To address this, we have realized the normal collection of process hotspot tracing through a series of advanced technical means. This collection mechanism consumes very little system resources and has virtually no impact on normal business operations. With the SysOM process hotspot tracing feature, you can view the running status of processes in previous periods at any time without manual logon to the server or real-time monitoring when historical jitters or other exceptions occur. This helps you quickly locate the root causes of problems and improve O&M efficiency.

2.3 Pain Point 3: Prolonged Troubleshooting with Persistent Underlying Issues

In the face of business issues, the lack of effective analytical tools often forces teams to implement temporary "band-aid" solutions that merely alleviate symptoms while leaving root causes unaddressed, and business problems still exist. This unresolved state can lead to prolonged system instability, potentially pushing the system to the brink of failure. The SysOM process hotspot tracing feature provides a variety of analysis capabilities. It not only covers traditional technical methods, such as trace analysis and comparative analysis, but also introduces advanced AI-powered intelligent diagnosis features. This approach integrates the experience and data of past historical problems, quickly locates the root cause of the problem, and directly provides users with clear diagnosis results, helping the team fundamentally solve business problems and improve system stability and reliability.

3. Solution: Diagnosis via the OS Console

Process hotspot tracing is performed in the following three steps:

1. Stack backtracking: Obtain detailed call stack information in kernel mode and user mode, which initially includes trace and address information.

2. Symbol resolution: Resolve the kernel-mode and user-mode addresses in the call stack into function names that are easy for users to understand.

3. Flame graph generation: Visually present call stack data in the form of a flame graph.

1

Next, we will delve into the strategies of stack backtracking and symbol resolution, and elaborate on the specific solutions adopted by SysOM.

Note: SysOM is an O&M component of the OS console.

3.1 Solution Overview

(1) Analysis of stack backtracking solution

Stack backtracking is mainly to obtain the complete call stack of the current program. As the first and critical step in generating a flame graph, this process presents two main technical challenges.

1. Applications without the frame pointer (FP): To optimize performance, many applications opt not to preserve the frame pointer. This prevents us from relying on traditional FP-based backtracking methods to obtain call stack information. Consequently, we have to switch to a more complex stack backtracking technique based on dwarf.

2. Stack backtracking for interpreted languages such as Java and Python: Each of these interpreted or high-level programming languages has a unique stack frame structure. Therefore, the key is to identify the currently running program and accurately parse its specific stack frame information.

To address these two challenges, SysOM leverages the programming flexibility of eBPF to support the stack backtracking feature of applications without the frame pointer and interpreted languages. In addition to eBPF, other mainstream stack backtracking solutions include perf and language-level interfaces like JVM TI provided by Java. The following is a comparative analysis of these three stack backtracking solutions.

Stack backtracking solutions Without FP Interpreted languages Limits on versions of kernels Stability Resource overhead
perf Supported, but the overhead will be high. Not supported None High Medium
eBPF Able to support (programmability of eBPF) Able to support ≥ 4.19 High Low
Language-level interfaces \ Supported None Medium Low

The above table presents the respective advantages and limitations of each solution.

① Perf, as a performance analysis tool with a long history, supports all versions of kernels but demonstrates inadequate capability in dealing with dynamic languages. In addition, when using dwarf-based stack backtracking, this results in significant resource consumption because the entire user-mode stack space needs to be output to the user-mode program.

② The programmability of eBPF opens up vast possibilities for new stack backtracking solutions, especially in supporting stack backtracking for dynamic languages. By analyzing runtime code information, eBPF can fully resolve the call stack of dynamic languages such as Java and Python, fully demonstrating its flexibility. The only limitation is its dependency on specific kernel versions.

③ As for language-level sampling tools, such as async-profiler, they can use the interface provided by JVM to collect stack information. They are not dependent on the kernel version and consume fewer resources. However, being intrusive sampling tools at the process level, they carry a minimal risk of crashing business applications, so there is a shortage in terms of stability.

For SysOM, the design objective is the consistent operation in production environments. Therefore, on the basis of ensuring feature integrity, stability is regarded as the most important factor. To achieve this, SysOM integrates three different solutions to provide comprehensive features across various scenarios. In the underlying decision logic, eBPF is set as the top priority, followed by perf, and finally language-level interfaces. The chart below shows the stack backtracking solution chosen based on different programming languages and kernel versions.

2

(2) Analysis of symbol resolution solution

Symbol resolution primarily involves converting the corresponding address into a function name. In general, for compiled language applications, this can be completed by looking up the symbol table in the ELF file, while for interpreted languages, symbols need to be read from the process memory. These are the technical problems that need to be solved by symbol resolution. In terms of architecture solutions, there are two options:

Solution Deployment dependency Memory usage Symbol accuracy
Local resolution Few High Medium
Remote resolution Multiple (dependency on network) Low High

As you can see from the above table, local symbol resolution exhibits fewer dependencies but requires symbol caching to accelerate lookup speed, resulting in higher memory usage. In addition, as most business applications do not include the debuginfo package when deployed in the production environment, symbols may be missing, which in turn compromises the accuracy of symbols. In contrast, remote symbol resolution has more deployment dependencies. For example, it relies on the network to transmit call stack information. However, it eliminates the need for caching symbols on the service machine, so the memory usage is lower. Meanwhile, remote resolution can download the debuginfo package of the application from repositories similar to the YUM repository to obtain more complete symbol information.

Local resolution is more suitable for performance profiling of a single machine, while remote resolution better suits cluster environments and large-scale deployment, which can significantly reduce the overall overhead. For example, if the same version of the MySQL application is deployed in the cluster, you only need to establish a global symbol cache to reduce resource consumption. Given the advantages of local and remote symbol resolution, SysOM supports both solutions.

3.2 Overall Architecture

The following figure shows the complete system structure from the underlying components (Coolbpf profiler) to the front end. It consists of three main components: SysOM frontend, SysOM Agent, and Coolbpf profiler. SysOM is an intelligent O&M platform, Coolbpf serves as the eBPF collection tool, and SysOM Agent is responsible for enabling Coolbpf functions and data communication.

3

Here is a detailed description of each part:

1) SysOM frontend: This is the interface for users to interact with the system, providing visualization capabilities for performance analysis. It contains three main functional modules:

  • Hotspot analysis: analyzes and displays the hotspots of performance bottlenecks in applications.
  • Hotspot comparison: allows you to compare the changes in performance hotspots across different instances, time periods, or conditions.
  • CPU & GPU heat map: provides CPU and GPU performance heat maps to assist you in identifying GPU-related performance issues.

2) SysOM Agent: As the middle layer, SysOM Agent is responsible for collecting and processing performance data and transmitting the results to the front end. It incorporates four hotspot modules:

  • OnCpu hotspot: detects hotspot problems on the CPU.
  • OffCpu hotspot: diagnoses why a process is blocked.
  • Memory hotspot: identifies hotspot areas in memory usage.
  • Lock hotspot: analyzes and reports performance issues caused by lock contention.

3) Coolbpf profiler: This is the underlying general-purpose performance analysis library that provides support for SysOM Agent. It contains two main parts:

  • eBPF & perf stack backtracking:

① uses eBPF technology to capture and analyze call stacks in kernel mode, supporting various programming languages, such as C/C++/Rust/GoLang and Java/Python/LuaJIT;

② uses the perf tool for call stack capture, which includes native code symbol resolution and perf-based call stack analysis for C/C++/Rust/GoLang.

  • User-mode symbol resolution: processes symbol information for user-mode programs, including symbol tables for compiled programs and runtime symbols of interpreted or high-level languages.

3.3 Frontend Showcase

This section describes how to use the frontend interfaces for hotspot analysis and hotspot comparison.

(1) Hotspot analysis

You can perform the following steps to perform hotspot analysis:

1. Parameter selection: Select Instance ID, Process Name, Hotspot Type, and Time Range in sequence. Finally, click the "Perform Hotspot Tracing" button. It should be noted that the hotspot type is dynamic, that is, it will be rendered based on the hotspot types the process contains in the current time period. For example, if it only contains OnCpu hotspots, then the drop-down list displays only the ONCPU option.

4

2. OnCpu hotspot: After we select ONCPU, the system will immediately render the corresponding flame graph as shown in the figure below.

5

3. Memory hotspot: If we select "Memory", the system will immediately render the flame graph of memory usage as illustrated below.

6

(2) Hotspot comparison

Hotspot comparison is a powerful tool for analyzing differences between normal and abnormal environments, enabling precise identification of root causes. Procedure:

1. Parameter selection: Unlike hotspot analysis which requires only one machine instance, hotspot comparison requires two machine instances. After the parameters are configured, click the "Execute Comparative Analysis" button to generate a comparative flame graph.

7

2. Memory differential flame graph: The following figure is a differential flame graph of the memory heat type. Since we selected identical machine instances, processes, and time periods, the differential flame graph appears in gray, indicating that there is no hotspot difference.

8

3.4 Case Studies

Next, we will use three practical cases to introduce how SysOM process hotspot tracing can quickly help us locate online issues.

Case 1: High load

High load is a common issue frequently encountered online, and there are many types of high load problems. This case focuses on locating the high load problem caused by high SysOM. Recently, a customer experienced periodically high loads - over ten times higher than normal. Let's take a look at how to quickly locate the problem through SysOM process hotspot tracing.

From the CPU hotspot distribution graph below, we can see that there is a sudden increase around 14:15, which is basically consistent with the time point of high load.

9

Zoom out the timeline and select the time period with the highest CPU hotspots. We can see the following heat map.

10

From the hotspot flame graph, we can see that the function with the highest hotspot is native_queued_spin_lock_slowpath, which is preliminarily judged as waiting for the lock and is the victim's first scene. Click on the hotspot function to highlight it in the flame graph.

11

The graph reveals that the vast majority of processes are waiting for locks. So how do we find the lock holder from these hotspots?

First, let's examine what locks these processes are waiting for in the graph - are they waiting for the same lock?

By clicking on the hotspot block in the graph, we can zoom in on a call stack.

12

By analyzing the do_wait function, we find that the process is waiting for the tasklist_lock read lock.

13

Analysis of the do_exit function reveals that the process is waiting for the tasklist_lock write lock.

14

Following the same approach, we randomly select several more functions, and find that they are all waiting for the tasklist_lock lock - some for the read lock, others for the write lock.

Next, we need to determine whether process hotspot tracing can quickly identify the lock holder.

The tasklist_lock is designed to be an RW lock that is not taken until it is interrupted. Theoretically, it should be held exclusively by a single kernel thread. The lock will not be released until the thread exits the critical section. Based on this principle, we can find a pillar in the flame graph that exclusively holds this lock. Using the function call relationship provided by liveTrace, we look at the call graphs of each top hotspot function one by one.

15

Selecting the top 1, we can view the information about the processes that are waiting for the tasklist_lock lock.

16

Selecting the top 2, we find the functions of process exit, most of which are on hotspots concentrated on waiting for tasklist_lock.

17

Selecting the top 3, and analyzing the copy_process function, we find that it is waiting for the tasklist_lock write lock.

18

Selecting the top 4, we see that _cond_resched is mainly called by do_exit and copy_process.

19

Selecting the top 5 - here is the key moment - the expected bare pillar finally appears. Combined with the kernel's tg_rt_schedulable code, we confirm that the call stack is holding a tasklist_lock read lock. Thus, the key call stack holding the lock is found.

Case 2: Network timeout

A customer reported intermittent network timeouts in their services. By viewing the process hotspot tracing, it was found that the hotspot was in the nft_do_chain. The following figure shows the function table and flame graph captured during a network timeout.

20

The nft_do_chain function handles processing netfilter rules. It is apparent that too many netfilter rules slow down the processing speed of the network protocol stack. Upon checking with the nft list ruleset command, we find that there are over 12,000 rules.

21

Case 3: High CPU usage of processes

The CPU usage of shell scripts on a customer machine was high. The following figure shows a captured process hotspot flame graph. From the flame graph, we can find that hotspots are mainly concentrated in shell_execve, that is, in the process of resolving shell script commands. Therefore, it can be speculated that the current shell script may be stuck in an abnormal infinite loop, which causes problems when the shell interpreter parses and executes shell commands. Therefore, the next step is to locate the abnormal command in the current shell script.

22

We learn that shell scripts execute shell commands through the execve system call, so we use strace to track the invocation of execve. As shown in the graph below, the script is repeatedly executing the ps and awk commands. A search for these commands in the shell script source code quickly leads us to the problematic source code.

23

Through SysOM process hotspot tracing, we preliminarily evaluate the running status of the shell interpreter and determine that it has entered an infinite loop repeatedly executing an abnormal command. Subsequently, we use the strace command to locate the abnormal shell command and trace it back to the problematic code in the script source code.

PC client of the Alibaba Cloud OS console: https://alinux.console.aliyun.com/

0 1 0
Share on

OpenAnolis

91 posts | 5 followers

You may also like

Comments