gdb: start debugging core dump from a pointer

thumbnail

gdb: start debugging core dump from a pointer

Author: viczhang, Tencent CSIG background development engineer

question

Recently, I reproduced a problem that the ceph file storage background process ceph-mds IO is stuck. In theory and implementation, as long as the cluster returns to normal, the stuck IO will return. But in fact, the process has been stuck and was eventually kicked off by the monitor component; because the process has been dealing with an unhealthy state, the high availability of the cluster has been reduced. We want to know why we got stuck, and during the investigation, general ideas are codified.

reproduce

By analyzing the problem symptoms and limited log attempts to reproduce, there was no success; fortunately, the problem occurred in a development and test environment; but unfortunately:

  • Because it is not actively reproduced, the log level is not enough; so immediately use gcore to make a core dump of the process; for post mortem analysis.
  • Develop a debug version, unfortunately, this version does not have debug symbols. After finding this problem, fix the pipeline immediately.

In everyone's cognition, it depends on the process core dump memory information; there must be debug symbol information corresponding to the data structure. A core dump without a debug symbol is like having a treasure, but you don't have the key to open it.

start with a pointer

In addition to the core dump of the process, there is also a useful log:

2022-10-10 12:25:02.233 7f57d70bf700 1 []-- [v2:192.168.56.102:6824/2753553827,v1:192.168.56.102:6825/2753553827] --> v1:192.168.56.103:6801/1057529 -- osd_op(unknown.0.17:3466 1.22 1:44de85d6:::client_192.168.56.101:head [omap-get-vals-by-keys] snapc 0=[] ondisk+read+known_if_redirected+full_force e132) v8-- 0x7f57f457d600 con 0x7f57f44f2000

According to this log, when the problem occurred, ceph-mds used the con object (0x7f57f44f2000) to make an IO request to OSD. And this pointer address is the key for us to open the treasure.

train of thought

The mainstream of C++ programs still uses the object-oriented programming paradigm, which usually uses a class to aggregate related variables. Then there are references between object instances. The pointer 0x7f57f44f2000 mentioned above is the actual column of AsyncConnection.

For example, I want to know which nodes the current ceph-mds has established network connections with. I only need to know the pointer address of async_msgr to know all the network connections. The async_msgr address is a member variable of the AsyncConnection instance.

The following is the python plugin script for gdb

class AsyncConnection: def init(self, addr) : self.addr = addr self.msgr = read_ptr(addr+96) # Offset 96 was recorded when the signed table was analyzed. class AsyncMessenger: def init(self, addr): self. addr = addr conns_addrs = UnorderedMap(addr+1792, allocator_sz=0) # 1792 Same as above

Therefore, through AsyncConnection(0x7f57f44f2000), the AsyncMessenger address is 0x7f57f365a000, and further you can know all the connection information

greater difficulty

What I really want to get is the address of the mds (class MDSDaemon) object, because as long as I get its information, basically all the data structure information can be viewed.

class MDSDaemon : public Dispatcher { AsyncMessenger *messenger; // Messenger offset is 768 }

Now that I know the address of AsyncMessenger 0x7f57f365a000, how to reversely deduce the address addr of the MDSDaemon instance, which satisfies the following equation:

read_ptr (addr + 768) = 0x7f57f365a000 Note: The function of the read_ptr method: read 8 bytes from a memory address.

The way to solve this equation is to traverse the memory.

traverse memory

The ceph-mds process uses the tcmalloc library to manage memory distribution. What we need to do is to analyze the data structure of tcmalloc, and tcmalloc is an open source component, and its debug symbol is easy to obtain.

The tcmalloc analysis does not start, and there are many resources on the Internet. The basic logic of the following code is

  • gdb command info variables pageheap_get pageheap_address

  • Traverse the allocated span in pageheap_

  • Search for a memory block that meets a certain size in the span

  • Search for 0x7f57f365a000 at offset 768 of the memory block that meets the condition

  • operation result

  • 0x7f57f365a900 is the MDSDaemon address we are looking for, so far we have obtained the key to open the treasure door.

In the ToB field, the difficulty is usually how to quickly locate the problem with limited information; and core dump information is undoubtedly the largest source of information. By using the gdb python API, the analysis of core dump can become more systematic and effective; the analysis process is also easy to settle down through scripting.

Latest Programming News and Information | GeekBar

Related Posts