Skip to content

Conversation

@yvyw
Copy link

@yvyw yvyw commented Jan 8, 2026

Fix #636. Should fix the error "Error occurred - packet doesn't match either side of the connection!!", if it is caused by hash collision.

@yvyw yvyw requested a review from seladb as a code owner January 8, 2026 13:09
@yvyw yvyw force-pushed the reassembly_fnv_hash_collision branch from e10839a to 34b0323 Compare January 8, 2026 16:52
@seladb
Copy link
Owner

seladb commented Jan 9, 2026

@yvyw thanks for working on this PR!
It contains quite a lot of changes, can you please add some description on what these changes are? Do you think it makese sense to break it down to smaller PRs?
Also - please notice the CI fails...

Should fix the error:
"Error occurred - packet doesn't match either side of the connection!!"
@yvyw yvyw force-pushed the reassembly_fnv_hash_collision branch from 34b0323 to d25bacd Compare January 9, 2026 09:16
@yvyw
Copy link
Author

yvyw commented Jan 9, 2026

Hi, @seladb! Thank you for PcapPlusPlus, it's very useful!

The changes are intended to replace flowKey with the real connection object in order to eliminate collisions.
It could be implemented by replacing the flowKey with an object that contains all connection information and fits the requirements for a key in an unordered_map.
Connection objects are represented by the Packet and the ConnectionData.
Objects that can be used as a key in an unordered_map should be hashable and equality comparable. Also a key object is saved in an unordered_map on addition along with a value as a std::pair.
So, the ConnectionHashable class containing only a raw pointer was added to be used as a key in unordered_map.
The IHashableConnectionInfo interface was introduced to calculate a hash and compare any objects that can provide connection information.
The IHashableConnectionInfo adapter for a Packet and the interface implementation for a ConnectionData were added.
The hash5Tuple was adapted to calculate the hash of objects that implement the IHashableConnectionInfo interface.
The fnvHash and hash5Tuple functions were updated to calculate a hash with the size of the size_t type, as required by std::hash.
flowKey replaced with ConnectionHashable.
The tests and examples were updated to work with the changes.

Backward compatibility is not preserved because the flowKey was declared as uint32_t and the TcpReassembly::getConnectionInformation() returns a raw reference to an std::map<uint32_t,ConnectionData>.
Maybe it's a good idea for modifiability to replace flowKey type with some using and raw references to internal structures with some functions returning iterators or references.

About breaking into a smaller PR's. I think I can, but it will take time. Also, PRs would look strange without knowing the final purpose. Maybe it would be better to leave it as it is or split it into commits without creating PRs. What do you think?

Regarding CI. The project passes the tests on my machine. I will fix the errors as they appear in CI.

@seladb
Copy link
Owner

seladb commented Jan 10, 2026

@yvyw replacing uint32_t with an object as the key of the map will make the map calculations slower. It will also increase the memory consumed by the map because the key object is bigger.

Sicne it's used in TCP Reassembly it should support high throughput and scale, and I think this implementation might cause degradation of both. Please let me know what you think.

@yvyw
Copy link
Author

yvyw commented Jan 10, 2026

@seladb, your concerns are valid. Here's what I think:

On calculations slowdown.
A hash is calculated on search, insertion (for the object being searched/inserted) and rehashing (for all objects in the map).
Therefore, it should not slow down once the connection is added to the m_ConnectionList.
At new connection creation, the hash would be calculated twice, first at the search stage and again at the insertion stage. Although it's not as efficient as a single calculation, the speed won't deteriorate significantly unless there are a large number of short connections.
And all hashes will be recalculated when rehashing occurs. Insertion in unordered_map should still have amortized complexity of O(1), but it could introduce unnecessary delays. Although it would be better to benchmark this, but we can act preemptively by adding a method to IHashableConnectionInfo that saves the hash in the object for the future use as we are currently doing with the flowKey.
There is also a difference in the hash size for 64-bit systems. Although it's better to benchmark too, I think it would not add significant slowdown since the machine natively supports 64-bit registers.
Should I add caching for the hash?

On memory consumption.
The ConnectionHashable is the same size as a pointer. Therefore, it is the same size as the flowKey on 32-bit systems and 32 bits larger on 64-bit systems. And an additional pointer to the vptr was added to the ConnectionData.
This patch also removes the duplication of ConnectionData in TcpReassemblyData, saving 104–8 bytes on x64 systems. And removes key from the m_ConnectionList, which saves an additional 4 bytes.
So net memory consumption will even decrease after the patch.
Link to play around with the structure packing: https://godbolt.org/z/MzYKbPfMs
Furthermore, memory consumption could be reduced by replacing the IPAddress structure with a std::variant<IPv4Address,IPv6Address> and replacing the timeval in the ConnectionData with an API, that converts time_point to timeval. Even further m_ConnectionInfo could be defined as a std::forward_list, saving an additional 8 bytes per connection.

@Dimi1010
Copy link
Collaborator

Dimi1010 commented Jan 10, 2026

Furthermore, memory consumption could be reduced by replacing the IPAddress structure with a std::variant<IPv4Address,IPv6Address>

I have thought about doing this, unfortunately std::variant is a C++17 construct, and we currently must support C++14.
We can reimplement it or use a backport library, but I don't think it would be worth it, compared to refactoring it when we bump the minimum supported standard.

@yvyw
Copy link
Author

yvyw commented Jan 10, 2026

We can reimplement it or use a backport library, but I don't think it would be worth it, compared to refactoring it when we bump the minimum supported standard.

I agree. I think this is premature optimization according to Knuth, and it's better to just wait for C++17. In the comment, I wanted to say that there is some more room for memory optimization if it's required. I think projects with extremely high loads will encounter problems anyway and end up tailoring their code with a profiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants