Skip to content

Conversation

@sam0044
Copy link

@sam0044 sam0044 commented Aug 20, 2025

qa/clyso/upgrade: adds variable placeholders in the workflows instead of hardcoding them
Signed-off-by: Sam Goyal sam.goyal@clyso.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

ronen-fr and others added 30 commits August 6, 2025 06:46
build_pg_dicts() is used to construct a set of dictionaries
(PG to Primary OSD, PG to Acting OSDs, etc.)
from the output of 'ceph pg dump'.  The original code
wasn't very efficient. So much so, that when used in a new
test that creates a large cluster, its run time was
prohibitively long.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
…es-have-incorrect-info-in-gui

mgr/dashboard: Fixed incorrect snapshot scheduled date for rbd block in GUI
…-in-grafana

mgr/dashboard: 72409 : Fixed parsing error in grafana for host overall performance iframe
rgw: check all JWKS for STS

Reviewed-by: Pritha Srivastava <prsrivas@redhat.com>
osd/scrub: do not limit operator-initiated repairs

Reviewed-by: Alex Ainscow <aainscow@uk.ibm.com>
Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/72421

Signed-off-by: Ankush Behl <cloudbehl@gmail.com>
rgw/multisite: Fix lifetime issues

Reviewed-by: Casey Bodley <cbodley@redhat.com>
Fixes: https://tracker.ceph.com/issues/70882
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
I had been thinking of list and trim as purely internal interfaces,
but they are called through HTTP and thus need to be prepared for bad
input.

Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Didn't include `associated_cancellation_slot.hpp`.

Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Asio does not have nearly as many actual explicit concepts one can use
as one might like.

And there's no reason we might not want our own asynchrony-related concepts.

Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
If path provided, use in statfs. Replumb internal statfs
for internal only to allow for use in ll_statfs and statfs

Fixes: https://tracker.ceph.com/issues/72355
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
qa/suites/krbd: use a standard fixed-1 cluster in unmap subsuite

Reviewed-by: Ramana Raja <rraja@redhat.com>
Reimplement with `initiate` rather than the old style. This
necessitates getting rid of the old `async::Completion` in anything
that was calling it, and other changes.

Also, use disposition for error handling.

Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
A convenience function for turning coroutines that return values and
use exceptions, `error_code`, or similar into `int`-returning
functions that take references to out parameters.

Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
This avoids having two entry points with different error checking
preparation, etc. to get out of sync or have a fix get forgotten.

Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Easier to debug.

Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
monitoring: Add per share metrics to SMB dashboard

Reviewed-by: Pedro Gonzalez <pegonzal@redhat.com>
…operations-erasure-code-profile-tr72436

doc/rados: Fix broken links

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
doc/install: Linkify mention of ceph.conf and use ref for links
…-troubleshooting-stuck-during-recovery

doc/cephfs: edit troubleshooting.rst

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
Follow up on comments made by Anthony D'Atri in
ceph#64832 and make other small changes to
increase the ease of reading this text.

Signed-off-by: Zac Dover <zac.dover@proton.me>
Edit "Avoiding Recovery Roadblocks" in the "Stuck During Recovery"
section of doc/cephfs/troubleshooting.rst.

This commit follows ceph#64854.

Signed-off-by: Zac Dover <zac.dover@proton.me>
…-troubleshooting

doc/cephfs: edit troubleshooting.rst

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
Avoids severe slowdowns with detect_stack_use_after_return=1.
The root cause is unclear, but ASan's fake stack GC behavior is
suspected. Tuning the UAR (Use-After-Return) fake stack size
(reduced from 64KB–1MB to 64KB) helped delay the onset of the
performance degradation.

Fixes: https://tracker.ceph.com/issues/71704

Signed-off-by: Chanyoung Park <chaney.p@kakaoenterprise.com>
Fixes: https://tracker.ceph.com/issues/70254

Signed-off-by: Chanyoung Park <chaney.p@kakaoenterprise.com>
…equence_10_bug_fix

test/osd: Fix pack for minor issues in ceph_test_rados_io_sequence

Reviewed-by: Ronen Friedman <rfriedma@redhat.com>
…-calls

mgr/cephadm: limit calls to list_servers

Reviewed-by: John Mulligan <jmulligan@redhat.com>
Signed-off-by: Redouane Kachach <rkachach@ibm.com>
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@sam0044 sam0044 closed this Aug 20, 2025
JoshuaGabriel pushed a commit that referenced this pull request Sep 27, 2025
Previously, run-cli-tests ignored all environment variables from the parent
process to ensure a clean test environment. However, this also dropped
sanitizer settings (ASAN_OPTIONS and LSAN_OPTIONS) needed when AddressSanitizer
is enabled.

This causes test failures with TCMalloc due to false-positive leak reports
from TCMalloc's internal objects, which is a known issue documented in
Google's C++ style guide. While recent gperftools releases have fixed this,
Ubuntu Jammy still ships with an older version that triggers these warnings.

This change preserves sanitizer environment variables while maintaining
the clean test environment for other variables.

Note: Once we upgrade to newer gperftools, we can remove the related
suppression rule in qa/lsan.supp.

The test failure with TCMalloc looks like:

```
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/cli/ceph-kvstore-tool/help.t: failed
--- /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/cli/ceph-kvstore-tool/help.t
+++ /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/cli/ceph-kvstore-tool/help.t.err
@@ -21,3 +21,19 @@
     stats
     histogram [prefix]

+
+  =================================================================
+  ==87908==ERROR: LeakSanitizer: detected memory leaks
+
+  Direct leak of 45 byte(s) in 1 object(s) allocated from:
+      #0 0x562fd797265d in operator new(unsigned long) (/home/jenkins-build/build/workspace/ceph-pull-requests/build/bin/ceph-kvstore-tool+0xe5e65d) (BuildId: 7eb56077b615aeb3c7aedafa0818ad89fdf3702d)
+      #1 0x562fd79815c8 in std::__new_allocator<char>::allocate(unsigned long, void const*) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/new_allocator.h:137:27
+      #2 0x562fd7981520 in std::allocator<char>::allocate(unsigned long) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/allocator.h:188:32
+      ceph#3 0x562fd7981520 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/alloc_traits.h:464:20
+      ceph#4 0x562fd798115a in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::_M_create(unsigned long&, unsigned long) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/basic_string.tcc:155:14
+      ceph#5 0x562fd798787f in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::_M_mutate(unsigned long, unsigned long, char const*, unsigned long) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/basic_string.tcc:328:21
+      ceph#6 0x562fd79876a7 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::_M_append(char const*, unsigned long) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/basic_string.tcc:420:8
+      ceph#7 0x7fa1aa0286f0 in MallocExtension::Initialize() (/lib/x86_64-linux-gnu/libtcmalloc.so.4+0x2a6f0) (BuildId: eeef3d1257388a806e122398dbce3157ee568ef4)
+
+  SUMMARY: AddressSanitizer: 45 byte(s) leaked in 1 allocation(s).
```

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
JoshuaGabriel pushed a commit that referenced this pull request Sep 27, 2025
Fix a memory leak in ErasureCodePluginExample when plugin registration
fails. The allocated ErasureCodePluginExample instance was not being
freed if ErasureCodePluginRegistry::add() failed, which occurs in tests
that intentionally register duplicate plugins.

ASan detected the leak:

```
Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x7f4501321a2d in operator new(unsigned long) /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_new_delete.cpp:86
    #1 0x7f4501a5914d in __erasure_code_init /home/kefu/dev/ceph/src/test/erasure-code/ErasureCodePluginExample.cc:44
    #2 0x5589985be68d in ceph::ErasureCodePluginRegistry::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>
> const&, ceph::ErasureCodePlugin**, std::ostream*) /home/kefu/dev/ceph/src/erasure-code/ErasureCodePlugin.cc:149
    ceph#3 0x5589984984ee in ErasureCodePluginRegistryTest_all_Test::TestBody() /home/kefu/dev/ceph/src/test/erasure-code/TestErasureCodePlugin.cc:116
```

Use unique_ptr to manage the plugin instance lifecycle, following the
pattern used by other erasure code plugins. The instance is now
automatically destroyed if registry addition fails.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
JoshuaGabriel pushed a commit that referenced this pull request Sep 27, 2025
Replace unsafe string construction with bufferlist::length() to avoid
reading beyond buffer boundaries.

In commit 92ccbff, we introduced a bug when checking if a digest was
empty by constructing a std::string from bufferlist:

```
std::string(digest.second.c_str()).empty()
```

This is unsafe because bufferlist data is not guaranteed to be null-
terminated. The std::string constructor searches for a null terminator
and may read beyond the bufferlist's allocated memory, causing a
heap-buffer-overflow detected by AddressSanitizer:

```
==66092==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7e0c65215004 at pc 0x7fbc6e27c597 bp 0x7ffe29fb6100 sp 0x7ffe29fb58b8
READ of size 5 at 0x7e0c65215004 thread T0
    #0 0x7fbc6e27c596 in strlen /usr/src/debug/gcc/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:425
    #1 0x562c75fad91a in std::char_traits<char>::length(char const*) /usr/include/c++/15.2.1/bits/char_traits.h:393
    #2 0x562c75fb4222 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&) /usr/include/c++/15.2.1/bits/b
asic_string.h:713
    ceph#3 0x562c761b81ae in operator() /home/kefu/dev/ceph/src/osd/scrubber/scrub_backend.cc:1300
    ceph#4 0x562c761d7d53 in operator()<mini_flat_map<shard_id_t, ceph::buffer::v15_2_0::list, signed char>::_iterator<false> > /usr/include/c++/15.2.1/bits/predefined_ops.h:318
    ceph#5 0x562c761d789c in __find_if<mini_flat_map<shard_id_t, ceph::buffer::v15_2_0::list, signed char>::_iterator<false>, __gnu_cxx::__ops::_Iter_pred<ScrubBackend::match_in_shards(const hobject_t&, auth_selection_
t&, inconsistent_obj_wrapper&, std::stringstream&)::<lambda(const std::pair<const shard_id_t, ceph::buffer::v15_2_0::list&>&)> > > /usr/include/c++/15.2.1/bits/stl_algobase.h:2095
    ceph#6 0x562c761d72b2 in find_if<mini_flat_map<shard_id_t, ceph::buffer::v15_2_0::list, signed char>::_iterator<false>, ScrubBackend::match_in_shards(const hobject_t&, auth_selection_t&, inconsistent_obj_wrapper&,
std::stringstream&)::<lambda(const std::pair<const shard_id_t, ceph::buffer::v15_2_0::list&>&)> > /usr/include/c++/15.2.1/bits/stl_algo.h:3921
    ceph#7 0x562c761d5f6f in none_of<mini_flat_map<shard_id_t, ceph::buffer::v15_2_0::list, signed char>::_iterator<false>, ScrubBackend::match_in_shards(const hobject_t&, auth_selection_t&, inconsistent_obj_wrapper&,
std::stringstream&)::<lambda(const std::pair<const shard_id_t, ceph::buffer::v15_2_0::list&>&)> > /usr/include/c++/15.2.1/bits/stl_algo.h:431
    ceph#8 0x562c761d4a50 in any_of<mini_flat_map<shard_id_t, ceph::buffer::v15_2_0::list, signed char>::_iterator<false>, ScrubBackend::match_in_shards(const hobject_t&, auth_selection_t&, inconsistent_obj_wrapper&, s
td::stringstream&)::<lambda(const std::pair<const shard_id_t, ceph::buffer::v15_2_0::list&>&)> > /usr/include/c++/15.2.1/bits/stl_algo.h:450
    ceph#9 0x562c761bb84b in ScrubBackend::match_in_shards(hobject_t const&, auth_selection_t&, inconsistent_obj_wrapper&, std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >&) /home/k
efu/dev/ceph/src/osd/scrubber/scrub_backend.cc:1297
    ceph#10 0x562c761b4282 in ScrubBackend::compare_obj_in_maps[abi:cxx11](hobject_t const&) /home/kefu/dev/ceph/src/osd/scrubber/scrub_backend.cc:941
    ceph#11 0x562c761d44af in operator()<hobject_t> /home/kefu/dev/ceph/src/osd/scrubber/scrub_backend.cc:887
    ceph#12 0x562c761d4836 in for_each<std::_Rb_tree_const_iterator<hobject_t>, ScrubBackend::compare_smaps()::<lambda(const auto:422&)> > /usr/include/c++/15.2.1/bits/stl_algo.h:3798
    ceph#13 0x562c761b3259 in ScrubBackend::compare_smaps() /home/kefu/dev/ceph/src/osd/scrubber/scrub_backend.cc:884
    ceph#14 0x562c761a478d in ScrubBackend::update_authoritative() /home/kefu/dev/ceph/src/osd/scrubber/scrub_backend.cc:315`
```

Fix by using bufferlist::length() which tells if the given buffer is
empty instead of converting the buffer content to a string.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
JoshuaGabriel pushed a commit that referenced this pull request Nov 6, 2025
…ives

Add suppression rules for two categories of false positive warnings
encountered during ASan-enabled testing:

1. PyModule_ExecDef memory leaks: ASan incorrectly interprets Python's
   module loading behavior as memory leaks when the interpreter loads
   extension modules.

2. __cxa_throw interception failures: ASan's interceptor cannot properly
   intercept exception handling when libstdc++.so is loaded after the
   ASan shared library, causing CHECK failures.

3. ErasureCodePluginRegistry::load:
   `ceph::ErasureCodePluginRegistry::load()` is known to leak, as we
   don't free the memory allocated by the ec plugins which are
   registered in the `ErasureCodePluginRegistry` singleton. this is a
   known issue, but since the `ErasureCodePluginRegistry` instance is a
   singleton. we can live with it. in this change, we add the rule to
   suppress the leak report from LeakSanitizer. this rule also exist in
   qa/valgrind.supp.

All warnings are confirmed false positives that should be suppressed
to reduce noise in test output.

Example warnings:

```
Direct leak of 3264 byte(s) in 1 object(s) allocated from:
    #0 0x7f6027d20cb5 in malloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:67
    #1 0x7f60277557ad  (/usr/lib/libpython3.13.so.1.0+0x1557ad) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    #2 0x7f6027756067  (/usr/lib/libpython3.13.so.1.0+0x156067) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#3 0x7f60278471a0  (/usr/lib/libpython3.13.so.1.0+0x2471a0) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#4 0x7f602774d031  (/usr/lib/libpython3.13.so.1.0+0x14d031) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#5 0x7b60234093bb in __Pyx_modinit_type_init_code.constprop.0 /home/kefu/dev/ceph/build/src/pybind/rados/rados.c:82066
    ceph#6 0x7b602340a826 in __pyx_pymod_exec_rados /home/kefu/dev/ceph/build/src/pybind/rados/rados.c:82755
    ceph#7 0x7f6027856777 in PyModule_ExecDef (/usr/lib/libpython3.13.so.1.0+0x256777) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#8 0x7f602785baa3  (/usr/lib/libpython3.13.so.1.0+0x25baa3) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#9 0x7f6027793df2  (/usr/lib/libpython3.13.so.1.0+0x193df2) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#10 0x7f6027777cbe in _PyEval_EvalFrameDefault (/usr/lib/libpython3.13.so.1.0+0x177cbe) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#11 0x7f60277957de  (/usr/lib/libpython3.13.so.1.0+0x1957de) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#12 0x7f60277d11b9 in PyObject_CallMethodObjArgs (/usr/lib/libpython3.13.so.1.0+0x1d11b9) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#13 0x7f60277d0ee4 in PyImport_ImportModuleLevelObject (/usr/lib/libpython3.13.so.1.0+0x1d0ee4) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#14 0x7f6027779c0c in _PyEval_EvalFrameDefault (/usr/lib/libpython3.13.so.1.0+0x179c0c) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#15 0x7f602784e2c8 in PyEval_EvalCode (/usr/lib/libpython3.13.so.1.0+0x24e2c8) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#16 0x7f602788c88b  (/usr/lib/libpython3.13.so.1.0+0x28c88b) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#17 0x7f602788985c  (/usr/lib/libpython3.13.so.1.0+0x28985c) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#18 0x7f6027886f57  (/usr/lib/libpython3.13.so.1.0+0x286f57) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#19 0x7f6027886211  (/usr/lib/libpython3.13.so.1.0+0x286211) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#20 0x7f6027885b82  (/usr/lib/libpython3.13.so.1.0+0x285b82) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#21 0x7f6027883e50 in Py_RunMain (/usr/lib/libpython3.13.so.1.0+0x283e50) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#22 0x7f602783bbea in Py_BytesMain (/usr/lib/libpython3.13.so.1.0+0x23bbea) (BuildId: bea05fc2c8bd66145b159f10dcd810ebe813af39)
    ceph#23 0x7f6027227674  (/usr/lib/libc.so.6+0x27674) (BuildId: 4fe011c94a88e8aeb6f2201b9eb369f42b4a1e9e)
    ceph#24 0x7f6027227728 in __libc_start_main (/usr/lib/libc.so.6+0x27728) (BuildId: 4fe011c94a88e8aeb6f2201b9eb369f42b4a1e9e)
    ceph#25 0x55dae17e6044 in _start (/usr/bin/python3.13+0x1044) (BuildId: 8c0dc848f5b978c56ebeb07255bb332b4b37ae4e)
```

```
AddressSanitizer: CHECK failed: asan_interceptors.cpp:335 "((__interception::real___cxa_throw)) != (0)" (0x0, 0x0) (tid=3246455)
    #0 0x7f345ea81979 in CheckUnwind ../../../../src/libsanitizer/asan/asan_rtl.cpp:69
    #1 0x7f345eaa790d in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:86
    #2 0x7f345e9e1d54 in __interceptor___cxa_throw ../../../../src/libsanitizer/asan/asan_interceptors.cpp:335
    ceph#3 0x7f345e9e1d54 in __interceptor___cxa_throw ../../../../src/libsanitizer/asan/asan_interceptors.cpp:334
    ceph#4 0x7f3458623def in void boost::throw_exception<boost::bad_lexical_cast>(boost::bad_lexical_cast const&) /opt/ceph/include/boost/throw_exception.hpp:165
    ceph#5 0x7f345997ad3b in void boost::conversion::detail::throw_bad_cast<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long>() /opt/ceph/include/boost/lexical_cast/bad_lexical_cast.hpp:93
    ceph#6 0x7f3459979d35 in unsigned long boost::lexical_cast<unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /opt/ceph/include/boost/lexical_cast.hpp:43`
```

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
JoshuaGabriel pushed a commit that referenced this pull request Dec 11, 2025
The static std::map max_prio_map was defined in the osd_types.h header
file, causing every translation unit that included this header to get
its own copy of the variable. This led to One Definition Rule (ODR)
violations where multiple instances of the same variable existed at
runtime.

During program cleanup, destructors for these multiple instances would
attempt to free the same memory regions, resulting in segmentation
faults in tcmalloc/memory allocator as seen with ceph-dencoder.

This issue surfaced after a yet-merged-change which converts erasure_code
and json_spirit to OBJECT libraries. Before that change, these were
STATIC libraries that were linked via target_link_libraries. The
incorrect linkage meant their object files (and thus their copies of
max_prio_map) were kept separate and didn't conflict at runtime.

After converting to OBJECT libraries and properly incorporating them
into libceph-common.so (commit 8b0e3fb2c23), the multiple copies of
max_prio_map from different translation units all ended up in the same
shared library, exposing the ODR violation. During program exit, the
dynamic linker attempted to run destructors for all instances, leading
to double-free crashes.

Fix by moving the map into a static helper function in PeeringState.cc
(the only file that uses it). The map is now a function-local static
const variable, ensuring a single instance that is properly initialized
and destructed.

Backtrace before fix:
```
    #0  0x00007ffff7dbb1a0 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int) () from /lib/x86_64-linux-gnu/libtcmalloc.so.4
    #1  0x00007ffff7dbb57f in tcmalloc::ThreadCache::Scavenge() () from /lib/x86_64-linux-gnu/libtcmalloc.so.4
    #2  0x00007ffff6bc8aa2 in std::__new_allocator<std::_Rb_tree_node<std::pair<int const, int> > >::deallocate (this=0x7ffff7d48f78 <max_prio_map>, __p=0x555555f43890, __n=1)
    ceph#3  0x00007ffff6bc89f9 in std::allocator<std::_Rb_tree_node<std::pair<int const, int> > >::deallocate (this=0x7ffff7d48f78 <max_prio_map>, __p=0x555555f43890, __n=1)
    ceph#4  std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<int const, int> > > >::deallocate (__a=..., __p=0x555555f43890, __n=1)
    ceph#5  std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_put_node (this=0x7ffff7d48f78 <max_prio_map>, __p=0x555555f43890)
    ceph#6  0x00007ffff6bc892e in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_drop_node (this=0x7ffff7d48f78 <max_prio_map>, __p=0x555555f43890)
    ceph#7  0x00007ffff6bc886e in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_erase (this=0x7ffff7d48f78 <max_prio_map>, __x=0x555555f43890)
    ceph#8  0x00007ffff6bc8854 in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_erase (this=0x7ffff7d48f78 <max_prio_map>, __x=0x555555f43cb0)
    ceph#9  0x00007ffff6bc8854 in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_erase (this=0x7ffff7d48f78 <max_prio_map>, __x=0x555555f43ad0)
    ceph#10 0x00007ffff6bc8805 in std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::~_Rb_tree (this=0x7ffff7d48f78 <max_prio_map>)
    ceph#11 0x00007ffff6bc7345 in std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >::~map (this=0x7ffff7d48f78 <max_prio_map>)
    ceph#12 0x00007ffff484bd51 in __cxa_finalize (d=0x7ffff7d3f440) at ./stdlib/cxa_finalize.c:97
    ceph#13 0x00007ffff6af9487 in __do_global_dtors_aux () from /home/kefu/dev/ceph/build/lib/libceph-common.so.2
    ceph#14 0x00007ffff7fbfd20 in ?? ()
    ceph#15 0x00007ffff7fc8fc2 in _dl_call_fini (closure_map=0x7fffffffd0f0, closure_map@entry=0x7ffff7fbfd20) at ./elf/dl-call_fini.c:43
    ceph#16 0x00007ffff7fcbe72 in _dl_fini () at ./elf/dl-fini.c:120
    ceph#17 0x00007ffff484c291 in __run_exit_handlers (status=0, listp=0x7ffff49f1680 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:118
    ceph#18 0x00007ffff484c35a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:148
    ceph#19 0x00007ffff4833caf in __libc_start_call_main (main=main@entry=0x55555556cd90 <main(int, char const**)>, argc=argc@entry=2, argv=argv@entry=0x7fffffffd488) at ../sysdeps/nptl/libc_start_call_main.h:74
    ceph#20 0x00007ffff4833d65 in __libc_start_main_impl (main=0x55555556cd90 <main(int, char const**)>, argc=2, argv=0x7fffffffd488, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd478) at ../csu/libc-start.c:360
    ceph#21 0x00005555555695e1 in _start ()
```

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
JoshuaGabriel pushed a commit that referenced this pull request Dec 11, 2025
See comment:
```
  //TODO: should be changed to return future<> once all calls
  //	  to refresh are through co_await. We return LBAMapping
  //	  for now to avoid mandating the callers to make sure
  //	  the life of the lba mapping survives the refresh.
```

For now introduce co_refresh and mark the existing refresh as
deprecated. Following work will audit all the existing users of
refresh and move them to the new method. This change is not trivial
so I prefer to follow up on this as a separate PR.

This should help avoiding UAR in suspension points:
```
==103588==ERROR: AddressSanitizer: stack-use-after-return on address 0xffff80197e90 at pc 0xaaaacb941b24 bp 0xffff7e48dd80 sp 0xffff7e48dd78
READ of size 8 at 0xffff80197e90 thread T1
    #0 0xaaaacb941b20 in boost::intrusive_ptr<crimson::os::seastore::LBACursor>::swap(boost::intrusive_ptr<crimson::os::seastore::LBACursor>&) /opt/ceph/include/boost/smart_ptr/intrusive_ptr.hpp:172:18
    #1 0xaaaacb941998 in boost::intrusive_ptr<crimson::os::seastore::LBACursor>::operator=(boost::intrusive_ptr<crimson::os::seastore::LBACursor>&&) /opt/ceph/include/boost/smart_ptr/intrusive_ptr.hpp:93:61
    #2 0xaaaacb933758 in crimson::os::seastore::LBAMapping::operator=(crimson::os::seastore::LBAMapping&&) /ceph/src/crimson/os/seastore/lba_mapping.h:46:48
    ceph#3 0xaaaacde2fa54 in ... crimson::os::seastore::LBAMapping&&, std::array<crimson::os::seastore::LBAManager::remap_entry_t, 1ul>) (.resume) /ceph/src/crimson/os/seastore/transaction_manager.h:1282:11
```

Deprecate is commented out since otherwise make check would fail.

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.