How do we merge clusters over different training splits

One of the main reasons we couldn't use s2spy during the Lorenz workshop was that the labels of the precursor regions weren't aligned over the training splits. 
When using RGDR to identify precursor regions of interest, we follow the following procedure: 
1. find correlating gridcells ([rgdr.get_correlation](https://github.com/AI4S2S/s2spy/blob/7e228d1e10cef6eac65dde91532b901c7a075145/s2spy/rgdr/rgdr.py#L357)), here we get the correlation value and the p value
2. make use of SKlearn dbscan ([rgdr.get_clusters](https://github.com/AI4S2S/s2spy/blob/7e228d1e10cef6eac65dde91532b901c7a075145/s2spy/rgdr/rgdr.py#L377)) to identify clusters of significantly correlating grid cells that lie in each other's vicinity. We can set the alpha value and tweak some parameters of dbscan like distance_eps and min_area_km2.

Uptil now, we haven't thought about a way to somehow align the areas over training splits. The example below shows that it is not trivial to match these areas. Here I have used 4 splits over the data in /tests to look at the clustered regions over the splits. I have adapted the plotting function in rgdr a bit to get the same colorbars for every figure.
Note:
- The resolution is quite low, higher res might decrease the ambiguity 
- The splitting is not so 'random', we miss 1980-1989 in the first plot as it is testing data. 1990-1999 in the 2nd plot and so on. With less and more random test years the found clusters over the splits can be expected to be more similar.
![image](https://user-images.githubusercontent.com/82503135/190191740-6e314517-425e-462c-81ae-490e9e8879ff.png)
![image](https://user-images.githubusercontent.com/82503135/190191795-301c9e1b-b35f-4741-9b89-4df1275532e5.png)

We could come up with some algorithm that mimics what we would identify as one cluster by eye. There are some things to consider:
- What do we use a rule to determine whether areas are the same areas between splits? We have discussed distance-based rules or a rule of overlapping areas, but I doubt whether one of them would work in the example above.
- The parameters of DBscan matter a lot for identifying the clusters
- How do we communicate the decisions we make to the user?
- How much flexibility do we want to give the user? Do we go for one implementation that we know works most of the time, or do we let the user change clusters if the result is not satisfactory? I know Sem sometimes merges clusters of which he knows from expert knowledge that they should belong together.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we merge clusters over different training splits #101

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How do we merge clusters over different training splits #101

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions