Link clustering v4 #90

sporksmith · 2023-02-21T16:37:03Z

An updated version of #84. Keeping the former around for the moment for reference, but this one is rebased on the current tornettools head, and changed to cluster /8's instead of /16's.

robgjansen

Found a bug while looking so am posting now so I don't forget during regular review.

robgjansen · 2023-03-06T16:03:43Z

tornettools/stage.py

+    for relay in relays.values():
+        for fingerprint in relay.fingerprints:
+            cluster_bandwidths = []
+            bandwidth = bandwidths.get(fingerprint)
+            if bandwidth is not None:
+                cluster_bandwidths.append({
+                    'bandwidth_capacity': int(bandwidth.max_obs_bw),
+                    'bandwidth_rate': int(median(bandwidth.bw_rates)) if len(bandwidth.bw_rates) > 0 else 0,
+                    'bandwidth_burst': int(median(bandwidth.bw_bursts)) if len(bandwidth.bw_bursts) > 0 else 0,
+                })
+            if len(cluster_bandwidths) > 0:
+                relay.bandwidth_capacity = max(b['bandwidth_capacity'] for b in cluster_bandwidths)
+                relay.bandwidth_rate = max(b['bandwidth_rate'] for b in cluster_bandwidths)
+                relay.bandwidth_burst = max(b['bandwidth_burst'] for b in cluster_bandwidths)
+                found_bandwidths += 1
+            else:
+                relay.bandwidth_capacity = 0
+                relay.bandwidth_rate = 0
+                relay.bandwidth_burst = 0


This looks wrong to me, because we reset the cluster bandwidths list as we iterate items in the cluster. The "cluster representative relay" will always get the bandwidth info of the last iterated relay in the cluster, rather than computing the bandwidth as the max over all relays in the cluster.

I think we want this instead:

diff --git a/tornettools/stage.py b/tornettools/stage.py index e11fe3c..298cdb1 100644 --- a/tornettools/stage.py +++ b/tornettools/stage.py @@ -136,9 +136,11 @@ def stage_relays(args): bandwidths = bandwidths_from_serverdescs(serverdescs) found_bandwidths = 0 + # each 'relay' may actually be many relays clustered into one for relay in relays.values(): + # first we want to collect all bandwidth info we have from everyone in the cluster + cluster_bandwidths = [] for fingerprint in relay.fingerprints: - cluster_bandwidths = [] bandwidth = bandwidths.get(fingerprint) if bandwidth is not None: cluster_bandwidths.append({ @@ -146,15 +148,16 @@ def stage_relays(args): 'bandwidth_rate': int(median(bandwidth.bw_rates)) if len(bandwidth.bw_rates) > 0 else 0, 'bandwidth_burst': int(median(bandwidth.bw_bursts)) if len(bandwidth.bw_bursts) > 0 else 0, }) - if len(cluster_bandwidths) > 0: - relay.bandwidth_capacity = max(b['bandwidth_capacity'] for b in cluster_bandwidths) - relay.bandwidth_rate = max(b['bandwidth_rate'] for b in cluster_bandwidths) - relay.bandwidth_burst = max(b['bandwidth_burst'] for b in cluster_bandwidths) - found_bandwidths += 1 - else: - relay.bandwidth_capacity = 0 - relay.bandwidth_rate = 0 - relay.bandwidth_burst = 0 + # now flatten the bandwidth info for the cluster down into a single bandwidth for the cluster representative + if len(cluster_bandwidths) > 0: + relay.bandwidth_capacity = max(b['bandwidth_capacity'] for b in cluster_bandwidths) + relay.bandwidth_rate = max(b['bandwidth_rate'] for b in cluster_bandwidths) + relay.bandwidth_burst = max(b['bandwidth_burst'] for b in cluster_bandwidths) + found_bandwidths += 1 + else: + relay.bandwidth_capacity = 0 + relay.bandwidth_rate = 0 + relay.bandwidth_burst = 0 logging.info("We found bandwidth information for {} of {} relays".format(found_bandwidths, len(relays))) # for (k, v) in sorted(relays.items(), key=lambda kv: kv[1].bandwidths.max_obs_bw):

robgjansen · 2023-03-08T17:03:01Z

Pushed fix, and experiments with the fixed version:
#94

sporksmith added 7 commits February 21, 2023 10:35

combine_parsed_consensus_results: split into parts

efd979c

combine_parsed_serverdesc_results -> bandwidths_from_serverdescs

c83e70a

do clustering

7761064

Make family calculations a bit more robust

c232e38

Incorporate country in cluster keys

b2a5b41

record nicknames for debugging

8778984

cluster /8's instead of /16's

03a0d5f

robgjansen requested changes Mar 6, 2023

View reviewed changes

robgjansen closed this Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Link clustering v4 #90

Link clustering v4 #90

Uh oh!

sporksmith commented Feb 21, 2023

Uh oh!

robgjansen left a comment

Uh oh!

robgjansen Mar 6, 2023

Uh oh!

robgjansen commented Mar 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Link clustering v4 #90

Link clustering v4 #90

Uh oh!

Conversation

sporksmith commented Feb 21, 2023

Uh oh!

robgjansen left a comment

Choose a reason for hiding this comment

Uh oh!

robgjansen Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

robgjansen commented Mar 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants