11---
2- .title = "Towards optimal an optimal debugging library framework ",
2+ .title = "Towards optimal debugging and related system design. ",
33.author = "Jan Philipp Hafer",
44.date = @date("2024-06-28:00:00:00"),
55.layout = "optimal_debugging.shtml",
88---
99
1010[]($section.id("intro"))
11- This article is intended as overview of software based debugging techniques and motivation for
12- uniform execution representation and setup to efficiently mix and match the
13- appropriate technique for system level debugging with focus on statically
14- optimizing compiler languages to keep complexity and scope limited.
11+ This article is intended as overview of software based debugging techniques to
12+ efficiently mix and match the appropriate technique for system level debugging
13+ with focus on statically optimizing compiler languages to keep complexity and scope limited.
1514The reader may notice that there are several documented deficits
1615across platforms and tooling on documentation or functionality, which will be improved.
1716The author accepts the irony of such statements by "C having no ABI"/many systems in
@@ -21,21 +20,21 @@ for brevity and sanity.
2120Section 1 (theory) feels complete aside of simulation and hard-/software replacement
2221techniques and are good first drafts for bug, debugging and debugging process.
2322Section 2 (practical) is tailored towards non micro Kernels, which are based
24- on process abstraction, but is currently missing content and scalability numbers
25- for tooling.
23+ on process abstraction, but is currently missing some content and
24+ scalability numbers for tooling.
2625The idea is to provide understanding and numbers to estimate for system design,
27261 if formal proof of correctness is feasible and on what parts,
28272 problems and methods applicable for dynamic system analysis.
29- Section 3 (future) will be on speculative and more advanced ideas, which should
30- be feasible based on numbers. They are planned to be about how to design
28+ Section 3 (future) will wrap-up practical problems of what is currently not
29+ well to use or possible in Section 2 and speculate about more advanced ideas
30+ for brevity without numbers.
31+ Those ideas are planned to be towards how to design
3132systems for rewriting and debugging using formal methods, compilers and
3233code synthesis.
3334
3435- 1.[Theory of debugging](#theory)
3536- 2.[Practical methods with trade-offs](#practice)
36- - 3.[Uniform execution representation](#uniform_execution_representation)
37- - 4.[Abstraction problems during problem isolation](#abstraction_problems)
38- - 5.[Possible implementations](#possible_implementations)
37+ - 3.[Wrap-up and future](#wrapup_future)
3938
4039[]($section.id("theory"))
4140### Theory of debugging
@@ -157,8 +156,10 @@ Formal methods, **Specification**, (software) system synthesis and **Formal Veri
157156
158157(Highly) safety-critical systems or hardware are typically created from formal **Specification**
159158by (software) system synthesis or, when (full) synthesis is unfeasible, implementations are formally verified.
160- To my knowledge no standards for (highly) security-critical systems exist,
161- which require formal **Specification** and **Formal Verification** or synthesis (2025-05-16).
159+ Standards for (highly) security-critical systems (like Creative Commons Evaluation Assurance Levels)
160+ provide customer assurances of the security policy according to the specification
161+ and are to my knowledge typically realized via **Specification** and **Formal Verification**
162+ without synthesis (2025-09-28).
162163
163164For non safety- or security-critical or hardware (sub)systems, usually
164165semantics are not "set into stone", so **Formal Verification** or (software) system
@@ -228,62 +229,126 @@ source code adjustments or use 3 tooling that use kernel APIs to trace and optio
228229Kernels further may simplify access to information, for example the `proc` file
229230system simplifies access to process information.
230231
232+ TODO proper benchmarks
233+
231234**Testing** is very context and use-case dependent with
232235typical separations being between pure/impure, time-invariant/variant,
233- accurate/approximate, hardware/software (sub)system separation from simple
236+ accurate/approximate, hardware/simulation/ software (sub)system separation from simple
234237unit tests up to integration and end to end tests based on
235238statistical/probability analysis and system intuition on determinstic expected
236239behavior based on explicit or implicit requirements.
240+
237241TODO tools, hardware, software, mixed hw/sw examples
238242
239243**Stepping**
240- * TODO time costs, sync options, etc
241-
242- **Logging**
243- * TODO
244-
245- **Tracing**
246- * TODO
247- - [ ] "Debugging And Profiling .NET Core Apps on Linux"
248- - [ ] https://github.com/goldshtn/linux-tracing-workshop
249- - [ ] CPU sampling linux perf, bcc; win ETW; macos; macos instruments dtrace
250- - [ ] dynamic tracing linux perf, systemtap, bcc; win nothing; macos dtrace
251- - [ ] static tracing linux LTTng, win ETW, macos nothing
252- - [ ] dump gen linux core_pattern, gcore; win procdump, WER; macos kern.corefile, gcore
253- - [ ] dump analysis gdb,lldb; visual studio, windbg, gdb,lldb
254- - [ ] lwn.net Unifying kernel tracing
255- - [ ] https://github.com/goldshtn/linux-tracing-workshop
256- - [ ] babeltrace https://babeltrace.org/
257- - [ ] There are no "works for all kernels" and "trace specific (group of) processes" solutions,
258- - [ ] so one has to do specific queries to constrain what data should be collected.
259- - [ ] For low latency overhead analysis, dtrace or inspired systems like bpftrace,
260- - [ ] bcc and systemtap can be used.
261- - [ ] ETW allows complete user-space captures
262- - [ ] Most related solutions use dtrace or
263- - [ ] TODO
264- - [ ] * list standard Kernel tracing tooling,
265- - [ ] * focus on dtrace and drawback of no "works for all kernels" "trace processes"
266- - [ ] * standard tooling for checking traced information
267- - [ ] * Tracers: dtrace, bpftrace, bcc, systemtap, ETW, darwin/macos?, other posix tools?
268- - [ ] - TODO memory/runtime/latency overhead etc
244+ Stepping is generally based on temporary substitution of the debugger target
245+ assembly with interrupt instructions (`INT` on x86).
246+ Typically, afterwards and simplifying here for brevity, control is then switched
247+ by the Kernel to the debugger to do interrupt logic execution like conditional
248+ breakpoint, other logical checks or querying registers, variables based on
249+ debug information, resuming execution or dumping the complete program state.
250+ However, Kernels abstract access, typically restrict one debugger per
251+ debugee process, add custom events and make things much slower
252+ due to Interrupt Routine execution and Kernel logic execution for data flow
253+ instead of either read/write buffers and asynchronous execution done from within
254+ the debuggee and debugger as fast path (also called non-stop debugging) or
255+ instruction emulation for tracing use cases.
256+ Fast-paths via "soft interrupts" at user-specified program states and/or timeouts
257+ or cycle detection.
258+ Customization (for user-implemented **Recording** etc), visualization and
259+ automation of the control logic and information is in the process of implementation
260+ by RAD Debugger without tackling the core bottlenecks yet (2025-09-27).
261+ Other implementations like gdb or lldb focus on functionality, like remote debugging,
262+ portability and utilities (record and replay, etc), over performance.
263+
264+ TODO potential hardware improvements based on simulation
265+
266+ **Logging and Tracing**
267+ Logging is typically applied to resolve problems of long-running and (intentional)
268+ hard to introspect systems and used via persistent or temporary storage.
269+ Logging does typically follow a log level convention with compile-time and/or
270+ run-time configuration.
271+ Tracers are used, where more user control or logic is needed, to track down
272+ problematic behavior and for short-running and (intentional) easy to introspect systems.
273+ dtrace is closest to being a cross-platform tracing solution via binary instrumentation
274+ based on debug information, but does not handle virtualization use cases yet.
275+ babeltrace is closest to being a unified (Linux) Kernel tracing solution.
276+ Accurate hardware based tracing can be done via CPU sampling used by Linux
277+ perf, Windows ETW, Macos dtrace or on barebone via frequency control and doing
278+ the respective assembly instructions.
279+ General Kernel space (less overhead or more flexible) tracing solutions are inspired
280+ by dtrace like systemtap, bcc and bpftrace and Kernels have lots of specialized
281+ tracing solutions to observe specific subsystems efficiently with a variety
282+ of application interfaces.
283+ OpenTelemetry can be used for logging, tracing and metrics of (cloud) distributed
284+ applications without storage, performance and network bandwidth concerns due to
285+ (very) verbose JSON without compression offering neither human readability nor
286+ high information density.
287+ To my knowledge, no structured encoding of system log, trace or metrics via ontologies
288+ or based on time synchronization models (for distributed systems) exists (2025-09-25).
289+
290+ TODO proof read tooling, + typical memory,runtime,latency overhead
291+ https://www.blackhat.com/presentations/bh-europe-08/Beauchamp-Weston/Presentation/bh-eu-08-beauchamp-weston.pdf
269292
270293**Recording**
271- * TODO requirements: eliminate non-deterministic choices for replaying, others
294+ Recording is typically applied to investigate and eliminate problem causes
295+ of a system and realized via 1 state snapshots based on upper bound states reachability
296+ in case of non-determinism and/or 2 elimination of non-determinism via 2.1 logging
297+ non-deterministic choices and/or 2.2 logging/pre-selection of choices.
298+ Typical examples are user input recording (gui, keyboard)
299+ and Kernel input/output recording (rr, time travel debugging).
300+ One excellent example, which utilizes recording, incremental compilation and live patching, is
301+ [Tomorrow Corporation Tech Demo](https://www.youtube.com/watch?v=72y2EC5fkcE).
272302
273303**Scheduling**
274- * TODO requirements: simplification methods, practicality
304+ Scheduling to debug requires sufficient control over the scheduler and typically
305+ simplification methods meaning to extend time duration of synchronization areas,
306+ to simplify state like testing a sub-system with edge cases and/or
307+ using artifical synchronization between operations and/or extracting or specifying
308+ synchronization and timing relations based on scheduler configuration, hardware
309+ and empiric observations.
310+ Debuggers like gdb, lldb, WinDbg provide very clumsy and insufficiently slow ways
311+ for such functionality.
312+ To my knowledge, no models or standards for synchronization, timing relations,
313+ scheduler configuration exist or project attempting a type 1 hypervisor similar
314+ to what a SPS allows with API for debugging purposes or project to annotate and
315+ extract synchronization and timing relations between tasks for optimizing scheduler
316+ decisions and (formal) model generation (2025-09-27).
275317
276318**Reversal computing**
277- * TODO how and when to write bijective code to simplify debugging
319+ Reversal computing is a typical explicit tactic in programs on error paths to undo the
320+ operation and usually fairly simple without Kernel/external input/output.
321+ When Kernel/external input/output is involved, high performance code uses batching
322+ and users of more "safety"-aware languages typically utilize type system
323+ (linear/affine types in C++/Rust) or verify cleanup (frama-c in C),
324+ but usually this only covers memory and not other effects.
325+
326+ TODO check database integrity + kernel/database security (integrity) strategies
327+ before making baseless claims. also check "let it crash"/actor systems
328+ To my knowledge, no widely aware strategy of "in-between cleanups"
329+ besides controlled shutdown via linear setup and teardown has been proposed
330+ (2025-09-27).
331+
332+ TODO complexity comparison
333+ * how to get to snapshot design + testing
334+ * error path system reset: how to test that erro path does correct reset of system?
335+ * how to do distributed system sync for reversal computing? shared log + log ops with second log?
278336
279337**Time-reversal computing**
280- * TODO use cases
338+ - time capturing during computing
339+ - assembly time capturing during computing, must ensure no data stalls may happen
340+ - fgpa or ASIC likely candidates
341+
342+ TODO
343+ * how to get to snapshot design + testing
344+ * error path system reset: how to test that erro path does correct reset of system?
345+ * how to do distributed system sync for reversal computing?
281346
282347The following is a list of typical problems with simple solution tactics.
283348To keep analysis simple, no virtual machine/emulator and simulation approaches are given.
284349
285- []($section.id("uniform_execution_representation "))
286- ### Uniform execution representation
350+ []($section.id("wrapup_future "))
351+ ### Wrap-up and future
287352
288353As it was shown before, modern languages simplify detection or elimination of
289354memory problems and runtime detectable undefined behavior. So far undetectable
@@ -304,18 +369,8 @@ Tracing platform solutions will always have trade-offs.
304369Complete solution tracing user process and related kernel logic is only
305370available as dtrace with non-optimal performance.
306371
307- TODO: (currently unused) what they have in common + motivation
308- TODO: Uniform execution representation and queries over program execution.
309-
310- []($section.id("abstraction_problems"))
311- ### Abstraction problems during problem isolation
312-
313- TODO: origin detection, isolation and abstraction
314-
315- []($section.id("possible_implementations"))
316- ### Possible implementations
317-
318- TODO: (currently unused)
319- query system data vs modify the system vs other to validate approaches;
320- Program modification and validation language, query language and alternatives.
321-
372+ TODO check
373+ * query system data vs modify the system vs other to validate approaches;
374+ * Program modification and validation language, query language and alternatives.
375+ * Uniform execution representation and queries over program execution.
376+ * origin detection, isolation and abstraction
0 commit comments