The Day after Patreon Imploded: scraped recap of my posts
Bye bye Patreon. All updates to RemedyBG will now be made here. Here is a recap of progress made throughout 2017:
I made a good bit of progress on the PE/PDB parser used for symbol lookup, file/line number information, and so forth. At a high level this included:
* Extracted necessary information from the PE header to determine starting address (e.g., WinMain) along with section information
* Completed reading the containing multi-stream file format used by PDBs
* Reading the PDB header stream (PDB signature, age, version information, and so forth)
* Began reading the debug information stream for obtaining a mapping between relative virtual address and the contributing module. The initial goal here is to have a look-up from current instruction pointer address to source and line number and tie that back into the debugging engine.
People have asked if RemedyBG will support Linux. The answer is no. Check out Lysa for this.
Others may ask, but thankfully have not, if RemedyBG will support .NET. The answer is hell no. I'm pretty sure .NET users are content with Visual Studio since it "good enough" and runs "fast enough". So yeah, no. Friends don't let friends use .NET.
At a high level, I made progress towards having a fast address to file/line number lookup using the hand written PDB parser.
* Implemented first pass parser for the debug information stream. This stream contains information about the various object files that make up the binary. In particular, we can figure out the address space range of a module (relative to the load address). With that we can get to the appropriate module stream.
* Now reading line number information from a module stream.
* Now reading per-module as well as public symbol information from the associated PDB file.
* The symbol information is used to compute the entry point address and, optionally, add a break point there.
* Most of the time was spent reverse engineering a poorly documented piece of the PDB file format involving filename look-ups from a file id. By documentation, I am referring to the (partial) public repository that Microsoft published a few years ago on github. I've pretty much zeroed in on the key pieces that I need, though.
I cleaned up some key pieces of the API to the point where I'm happy with them. Added support for detaching a process from the debugger. Added callback for handling OutputDebugString. The ExitProcess callback now returns the exit code. Now returning an enum for the status codes with a number of more descriptive returns. A bunch of other cleanup.
I was planning on returning the status of loading PDB from the exe in the rdbg_DebugNewProcess but realized this was not great decision. Instead, the PDB load status will be baked into the synthetic structure __modules as one of its fields.
We now have a fast and reliable way to perform function name look ups for adding breakpoints by name. The PDB file is made up of a number of different "streams" somewhat akin to a mini-file system. One of the main streams is the debug information stream which contains the overall symbol table for all of the modules (appears to be one module per OBJ file linked to create the binary). In this symbol table are procedure references that point into a module's symbol table. The PDB format offers this as an optimization to avoid having to traverse each module. In any case, I decided to use the FNV-1a hash for hashing procedure names and am using a hash table sized at 2/3 load, resolving collisions via linear probing. I've only computed the stats for the hash table on a few PDBs so these choices may have to be tweaked in the future. In any case, breakpoints by name are finally back after ditching Microsoft's dbghelp library. A lot of other stuff is falling out due this work so things should start moving a bit faster.
Removed the handful of memory allocations that were using malloc/free. All memory allocations, including scratch/temporary memory, now go through the memory arena. The memory arena is set up using 8 MB blocks (not set in stone just yet) in which additional blocks are added dynamically as necessary.
It did sadden me some to realize I needed to bring in a third-party library. I decided to use Intel's XED for decoding instructions for the assembly output (and for the step-in functionality which scans forward for the next 'call' instruction). Luckily, it doesn't suck. It uses only a handful of the C-runtime library (like 7 functions) and uses zero memory allocations. After messing with this myself some I realized a few things: a) it would be super fun to write but b) it would take me way too long to support all the various processors, instruction set extensions, and so forth. Alas.
Today I've wrapped up functionality required for adding breakpoints. This includes adding breakpoints
* from a function name (places a breakpoint at the first line of the function)
* from a source location (filename along with a line number)
* from an address in memory
Adding a breakpoint given a source location requires a bit of maneuvering through the PDB file. I'll briefly explain how this works in case you are interested.
To avoid duplicating file names throughout individual translation units, strings are stored in the global string table that is part of the PDB file. Recall a PDB file is split into multiple streams -- sort of like a mini filesystem. To obtain the stream number containing the global string table we need to look at the PDB header. In there lives a data structure that contains the names of some auxiliary streams. We locate the offset in this data structure of the stream named "/names" which is the name of the global string table stream. With this offset, we look in another, different data structure that maps name offsets to stream numbers. We then use this to determine the actual stream number in the PDB that we need.
Now that we have the stream number we can open the global string table stream. This stream contains two things: a buffer of strings followed by a hash table. Our goal for this part will be to retrieve an offset into the buffer of strings of the filename we are looking for. The hash table contains a list of byte offsets into the string buffer and collisions for this hash table are resolved using open addressing. To find a string we first hash the given string S modulo the size of the hash table using a simple hash function specific to the PDB format to obtain an index into the table. Next, we read the offset contained in the hash table, and then do a string compare with S and the value in the string buffer at this offset. If the strings don't match, then linearly probe forward until an empty slot is reached (wrapping if necessary) or the string is found.
Next, we need to determine which section and offset this string (the filename) is contained in (if any). Each set of debug information (one for each binary loaded in the system) contains a set of data structures that can have line number information associated with them. The line number information has a couple of pieces that we need for this task: i) a set of variable length structures that contain file checksums along with associated offsets into the global string table and ii) line number information that reference these file checksums. The offset into the global string table that we found above is used to search the file checksums to see if the specified file lives in that module. If we find a match, then the offset into the file checksum list (note that this is yet another, completely different offset) can be used to search the line number information. The line number info contains the starting line number and offset into the section which we need.
Now that we've found the section and offset into that section we can compute the actual address and at long last add a breakpoint at that address. This is done by modifying the instruction in the debugged process with a single byte 'int 3' (0xCC) instruction at this address. The main debugger loop receives a debug event when this breakpoint hit and we handle it as necessary.
With breakpoints "complete" (I realize there is additional functionality to be added at some point) I plan on moving away from my trusty command-line debugger tool and begin to integrate new functionality into Vim.
The re-integration with Vim went well. The process control, adding and visualizing breakpoints, displaying the current line, and displaying OutputDebugString messages is now complete
For the past few days I've been back on the core debugger engine working on step-out functionality. This involves using function records contained in the PE header along with the associated unwind op-codes to determine the calling function (for non-leaf functions at least). Pretty interesting stuff. Need to determine whether the instruction pointer is in the epilogue, prologue, or body of a function and then simulate the values (forward or backwards) of the registers appropriately.
Firstly, I wanted to mention that I streamed a live RemedyBG programming session today (for the first time ever). There were precisely zero viewers but at the very least it helped to talk things through. I was hunting down the last bug I had in the step-out code. Although I didn't solve the bug on the stream I did get it nailed in short order after a break. Turns out it was a bug I introduced when starting the step-out code that borked the reading of the section headers. Because of that, the user breakpoint address, when set via function name, ended up getting resolved incorrectly. A couple of lines of code later and all was well.
There are still some performance tweaks I want to implement before moving on (primarily an instruction cache to reduce calls to ReadProcessMemory during an unwind) but ready to move on to the (much easier) step-over and step-into functionality.