Skip to content

Experimental debugger API #9604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 26, 2025
Merged

Experimental debugger API #9604

merged 7 commits into from
Apr 26, 2025

Conversation

jcpetruzza
Copy link
Contributor

This PR is the complement to #9334 (adding the debug_line instruction) and together with #8670 (pausing proc timers) add the support on the VM that drives the new edb debugger.

These are main parts of the PR:

  • A new +D emulator flag is added. When given, the VM becomes "debuggable", which means that when modules are loaded, the debug_line instruction is instrumented similar to what is currently done for tracing-breakpoints, so that one can enable and disable breakpoints on that particular lines.

  • An experimental erl_debugger module is added, with a new debugging API. Essentially, it allows a single, local, process to be registered as the "debugger" process for the node. This process is the one that will receive messages notifying that a process hit a breakpoint. This way, we can decouple debugger implementations that consume this API (like edb) from OTP itself.

  • The erl_debugger module also exposes new BIFs to inspect X and Y registers of a suspended process. Together with the new code-information BIFs added in Add support for debug information for a native debugger #9334, this let's a debugger show the values of variables in scope for a suspended process.

Copy link
Contributor

github-actions bot commented Mar 18, 2025

CT Test Results

    38 files   1 003 suites   7h 26m 30s ⏱️
12 420 tests 11 713 ✅   703 💤 4 ❌
26 816 runs  24 824 ✅ 1 988 💤 4 ❌

For more details on these failures, see this check.

Results for commit 1fb9829.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@bjorng bjorng added team:VM Assigned to OTP team VM feature labels Mar 19, 2025
@bjorng
Copy link
Contributor

bjorng commented Mar 25, 2025

I've pushed a fixup commit with mainly documentation fixes.

@jcpetruzza jcpetruzza force-pushed the erl_debugger branch 2 times, most recently from 4921f0a to 93f4ca9 Compare April 1, 2025 10:57
@jcpetruzza
Copy link
Contributor Author

  • Applied comment fixes
  • Rebased and updated code to use the new version of the debug_line format
  • Fixed beam_debug_info_SUITE
  • Updated comment style

facebook-github-bot pushed a commit to WhatsApp/edb that referenced this pull request Apr 2, 2025
Summary: This imports the latest version of [#9604](erlang/otp#9604). In particular, it is rebased on top of [#9646](erlang/otp#9646), that extended the `debug_line` instruction with an atom to distinguish the "entry" marker and the actual lines.

Reviewed By: TD5

Differential Revision: D72258205

fbshipit-source-id: 1bdd29b37ce867b6c2b6b330b6c1a865b046baf0
facebook-github-bot pushed a commit to WhatsApp/edb that referenced this pull request Apr 3, 2025
Summary:
We update the toolchain to the latest version of [#9604](erlang/otp#9604), which fixes the build for ARM.

Diff of changes [here](https://github.com/jcpetruzza/otp/compare/edb-2025-04-03..edb-2025-04-01)

Reviewed By: TD5

Differential Revision: D72394516

fbshipit-source-id: 755f9afa91a95099fd9037d6a51d36f003196e52
@bjorng bjorng requested review from jhogberg and garazdawi April 4, 2025 07:23
@bjorng
Copy link
Contributor

bjorng commented Apr 4, 2025

For security, we have a policy that only members of the OTP team are allowed to add BEAM files to primary bootstrap or erts/preloaded/ebin. Therefore, I've rebuilt erts_internal.beam and force-pushed to your branch.

@bjorng bjorng added the testing currently being tested, tag is used by OTP internal CI label Apr 4, 2025
@bjorng
Copy link
Contributor

bjorng commented Apr 4, 2025

I've revised the last commit to adhere to our comment conventions and force-pushed.

@garazdawi garazdawi removed the testing currently being tested, tag is used by OTP internal CI label Apr 4, 2025
@bjorng bjorng added the testing currently being tested, tag is used by OTP internal CI label Apr 4, 2025
@bjorng
Copy link
Contributor

bjorng commented Apr 4, 2025

I've force-pushed a new version, where I have:

  • Run make format to update the formatting for C++ files in the JIT (and .exs files for the doc build).
  • Updated the headers.
  • Fixed a broken link in the documentation. (Added back ticks around t:stack_frame/0.)

We add a new erl_debugger module to the kernel
application, that will provides an API for writing
debuggers. Here we set up the general interface,
with a stub for "setting a breakpoint" as the only
debugging primitive. In the next commits we first
implement the "breakpoint" primitive, and later
add additional ones.

About the general design:

- A node needs to have debugging support enabled
  on startup with the `+D` option and this can't be
  enabled afterwards. This gives admins a way to
  prevent processes from blocking on prod due to an
  accidental debugger usage, etc.
- Debugging primitives based on code instrumentation
  can be individually enabled/disabled, etc.
- A debugger is an Erlang process; there can be at most
  one debugger process per node.
- The system will communicate with the debugger process
  via `debugger_events` messages. For now we introduce
  only one message for notifying that a process hit
  a breakpoint
On module loading, `debug_line` are currently ignored.  Now, when
debugger support is enabled (and the line-breakpoint instrumentation
enabled), we emit a `i_debug_line` instruction. For now, we only use
it to add entries to the line-table, the actual breakpoint
instrumentation is introduced next.
The logic was duplicated and out of sync. We want to add another internal
function for dealing with breakpoints, so better to have all in one place
@bjorng bjorng added the testing currently being tested, tag is used by OTP internal CI label Apr 4, 2025
@bjorng
Copy link
Contributor

bjorng commented Apr 4, 2025

Added to our daily builds.

@bjorng bjorng removed the testing currently being tested, tag is used by OTP internal CI label Apr 7, 2025
@bjorng
Copy link
Contributor

bjorng commented Apr 7, 2025

@jcpetruzza

This PR does not build on Windows (see the unsuccessful tests above).

It seems to be because of this new line in beam_asm.h:

void beamasm_emit_align(void *instance, uint32_t alignment);

The culprit seems to be the use of uint32_t. It is unclear to me why it does not work here; it seems to work in some other places in the runtime system.

I suggest that you replace all your new uses of uint32_t with Uint32, which is our own type that we use for all internal APIs.

@jcpetruzza
Copy link
Contributor Author

@bjorng Interesting, I'll fix (need to get some windows box where I can test it first)

@bjorng
Copy link
Contributor

bjorng commented Apr 8, 2025

It turns out that beamasm_emit_align() was neither defined nor used, so I removed its prototype, and now this PR builds on Windows. I've added it to our daily builds.

Regarding the uint32_t type on Windows, I think that it is defined when compiling C++ files, but not when compiling C files. Therefore, it should never be used in a .h file.

@jcpetruzza
Copy link
Contributor Author

@bjorng Thanks for the quick fix! beamasm_emit_align() was a remnant of a previous version that I failed to remove from the header file. Re uint32_t, in C it is available via <stdint.h>; it is likely we were getting this include transitively in some platforms but not others.

@bjorng
Copy link
Contributor

bjorng commented Apr 9, 2025

So now it seems to build on all platforms.

Unfortunately, we had multiple core dumps in our daily builds. Here is one of them:

Thread 1 (Thread 0x7fbf84fc0700 (LWP 7510)):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fbfc687b859 in __GI_abort () at abort.c:79
#2  0x000055d2758e6579 in erl_assert_error (expr=expr@entry=0x55d275cb0958 "checked_cp_val((sp[0]),\"beam/erl_debugger.c\",627) == ((void *)0) || sp < (Eterm*)checked_cp_val((sp[0]),\"beam/erl_debugger.c\",627)", func=func@entry=0x55d275cb0a80 <__func__.24841> "erl_debugger_stack_frames_2", file=file@entry=0x55d275cb09db "beam/erl_debugger.c", line=line@entry=627) at sys/unix/sys.c:1082
#3  0x000055d2757cc4b9 in erl_debugger_stack_frames_2 (A__p=0x55d2779d2ab0, BIF__ARGS=, A__I=) at beam/erl_debugger.c:630
#4  0x000055d27568fd08 in debug_call_light_bif (c_p=0x55d2779d2ab0, reg=0x7fbf84fbbd40, I=0x7fbf750df61c , vbf=0x55d2757cc1a4 ) at beam/jit/x86/instr_bif.cpp:394
#5  0x00007fbf7400068b in global::call_light_bif_shared ()

This happens for a debug-enabled runtime with perf support. The layout of the stack is different when perf-support is enabled.

@jcpetruzza
Copy link
Contributor Author

I'll look into this, thanks!

@bjorng
Copy link
Contributor

bjorng commented Apr 10, 2025

I've now examined another core dump. It turns out that erl_debugger_SUITE leaves some processes in an inconsistent state, and if a later test suite calls erlang:system_info(procs) the runtime system could crash.

I've pushed a fixup commit that calls erlang:system_info(procs) at the end of each group. That makes it possible to trigger the crash by running only erl_debugger_SUITE. The runtime system needs to be built in the following way:

make TYPE=debug FLAVOR=emu

The runtime system needs to be started in the following way:

$ERL_TOP/bin/cerl -emu_type debug -emu_flavor emu

If running on a Mac (and possibly when compiling the runtime system with clang on Linux), it is also necessary to apply the following patch to avoid falsely triggering an assertion:

iff --git a/erts/emulator/beam/emu/beam_emu.c b/erts/emulator/beam/emu/beam_emu.c
index 892ef2cdce..2d34882fee 100644
--- a/erts/emulator/beam/emu/beam_emu.c
+++ b/erts/emulator/beam/emu/beam_emu.c
@@ -88,9 +88,7 @@
 #ifdef NO_JUMP_TABLE
 #  define VALID_INSTR(IP) (BeamCodeAddr(IP) < (NUMBER_OF_OPCODES*2+10))
 #else
-#  define VALID_INSTR(IP) \
-    ((BeamInstr)LabelAddr(emulator_loop) <= BeamCodeAddr(IP) && \
-     BeamCodeAddr(IP) < (BeamInstr)LabelAddr(end_emulator_loop))
+#  define VALID_INSTR(IP) 1
 #endif /* NO_JUMP_TABLE */
 
 #define SET_I(ip) \

@garazdawi garazdawi added the full-build-and-check Run a more thorough testing of this PR label Apr 10, 2025
@jcpetruzza
Copy link
Contributor Author

@bjorng Re the erlang:system_info(procs) issue, it looks like this is a pre-existent problem that is exposed by the erl_debugger_SUITE. I've boiled it down to this, that also fails on master. Examining the stack-trace, the process with the issue has the same pid printed by the test:

-module(repro_SUITE).

-export([all/0, groups/0]).
-export([init_per_group/2, end_per_group/2]).

-export([test_stuff/1]).

-include_lib("common_test/include/ct.hrl").


all() ->
    [{group, g}].

groups() ->
    [{g, [test_stuff]}].

init_per_group(_Group, Config) ->
    Config.

end_per_group(_Group, _Config) ->
    true = is_binary(erlang:system_info(procs)),
    ok.

test_stuff(Config) ->
    Mod = dyn_module,
    compile_and_load_dyn_module(Config, []),

    erlang:register(?MODULE, self()),

    P = erlang:spawn(Mod, sync_and_hibernate, []),
    ct:pal("HIBERNATING PROCESS ~p~n", [P]),
    receive {sync, P} -> ok end,

    compile_and_load_dyn_module(Config, []),
    compile_and_load_dyn_module(Config, []),
    ok.

compile_and_load_dyn_module(Config, Opts) when is_list(Opts) ->
    PrivDir = proplists:get_value(priv_dir, Config),
    File = filename:join(PrivDir, "dyn_module.erl"),
    file:write_file(File, dyn_module_source()),

    {ok, Mod, Code} = compile:file(File, [binary, report | Opts]),
    {module, Mod} = code:load_binary(Mod, "", Code),
    ok.

dyn_module_source() ->
    ~"""
    -module(dyn_module).
    -export([wake_up/1, sync_and_hibernate/0]).

    wake_up(X) ->
        X + 1.

    sync_and_hibernate() ->
        repro_SUITE ! {sync, self()},
        erlang:hibernate(?MODULE, wake_up, [10]).
    """.

Running it as follows, crashes on 90% of the runs for me:

> git rev-parse HEAD                                    
72903d0f471da9f49000e7376c3c59562a76b0e9

> $ERL_TOP/bin/erl -emu_type debug -emu_flavor emu -eval 'ct:run(".", "repro_SUITE"), erlang:halt().'
Erlang/OTP 28 [RELEASE CANDIDATE 2] [erts-15.2.5] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [type-assertions] [debug-compiled] [lock-checking]

Eshell V15.2.5 (press Ctrl+G to abort, type help(). for help)

Common Test: Running make in test directories...

CWD set to: "/home/danielgo/repos/otp/foo/[email protected]_15.22.29"

TEST INFO: 1 test(s), 1 case(s) in 1 suite(s)

Testing otp.foo.repro_SUITE: Starting test, 1 test cases

----------------------------------------------------
2025-04-10 15:22:29.745
HIBERNATING PROCESS <0.179.0>


1> beam/erl_printf_term.c:402:checked_is_not_list() Assertion failed: TYPE ASSERTION: !is_header(x)
                                                                                                   [1]    1071937 IOT instruction (core dumped)  $ERL_TOP/bin/erl -emu_type debug -emu_flavor emu -eval 

Looking into the perf one now

@jcpetruzza
Copy link
Contributor Author

Added two fixups: 1) cleanup in a testcase to workaround the hibernation bug above, 2) handle FP with perf enabled for inspecting the stack. Breakpoints are not working correctly with perf enabled, looking into that now

@jcpetruzza
Copy link
Contributor Author

Pushed fix for line-breakpoints with perf enabled

@jcpetruzza
Copy link
Contributor Author

Pushed further fixes to ensure erl_debugger_SUITE passes on emu flavor as well

@jcpetruzza
Copy link
Contributor Author

One more fixup: handle correctly setting/clearing breakpoints on a deleted module

facebook-github-bot pushed a commit to WhatsApp/edb that referenced this pull request Apr 15, 2025
Summary:
We bumpt the toolchain to the latest version of the [#9604](erlang/otp#9604)

Main changes:

* Formatting changes
* Fixes build on windows
* Workaround issue in VM related to hibernated processes and code purging
* Fix issues when running with perf support
* Fix issues when running with flavor `emu`
* Fix issue setting/clearing breakpoints on deleted module

Full list of changes [here](https://github.com/jcpetruzza/otp/compare/edb-2025-04-03..edb-2025-04-14)

Reviewed By: alanz

Differential Revision: D72969015

fbshipit-source-id: 0bca3514ae30a07b91b37ed41a089439a67e34b7
@bjorng bjorng added the testing currently being tested, tag is used by OTP internal CI label Apr 17, 2025
@bjorng
Copy link
Contributor

bjorng commented Apr 17, 2025

Thanks! Added to our daily builds.

We add trampolines on each debug_line instruction, similar
to the ones used on function entry. The main difference is that
we need to save all live X-registers on the stack before calling
`erts_internal:breakpoint/4`, and thus may require a GC.
We want to reuse the logic currently use to flip beakpoints for tracing, etc.
However, at the moment, the functions where this happens not only change the
code, but also set some flags value that are specific to those cases...

We factor out this part of the code to their own functions. In doing so, we observe
that for x86-64, the code was referring to `BEAM_ASM_FUNC_PROLOGUE_SIZE` to determine
the offset to use when "disabling" a breakpoint that doesn't seem relevant. The value
is indeed 6 (see, e.g., the assertion on the "enabling" case), and this corresponds to:
  - 1 byte for the NOP 0x90 due to call alignment
  - 1 byte for the opcode of a `CALL`
  - 4 bytes for the 32-bit offset of the call target
@bjorng bjorng removed the testing currently being tested, tag is used by OTP internal CI label Apr 26, 2025
@bjorng bjorng merged commit 0a3e258 into erlang:master Apr 26, 2025
58 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature full-build-and-check Run a more thorough testing of this PR team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants