Skip to content

Conversation

@CarloMariaProietti
Copy link
Contributor

@CarloMariaProietti CarloMariaProietti commented Dec 11, 2025

Fix #1492
The idea is the following:
ValueColumnInternal is an interface for statistic values, which in this way are not exposed as public.
Implementations of ValueColumnInternal contain the actual cache.

It was necessary to have two caches for each stat (for the moment only max) because computing the stat may give different outputs basing on skipNaN boolean parameter.

I implemented the solution by overloading aggregateSingleColumn, this overload exploits the original aggregateSingleColumn by wrapping it so that it is possible to exploit caches.

For the moment there is only max, however it would be easy to do the same with min, sum, mean and median.
For percentile and std it could be done something similar.

import kotlin.reflect.KType
import kotlin.reflect.full.withNullability

public class WrappedStatistic(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class should not be public, should it?


public fun <T : Comparable<T>> DataColumn<T?>.maxOrNull(skipNaN: Boolean = skipNaNDefault): T? =
Aggregators.max<T>(skipNaN).aggregateSingleColumn(this)
if (this is ValueColumnInternal<*>) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so fond of this solution, as it requires a lot of refactoring in other functions, plus it does not work when you write df.max { myCol }, as I mentioned in #1492 (comment)

Copy link
Collaborator

@Jolanrensen Jolanrensen Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead. I'd do this check inside the original aggregateSingleColumn(). Each Aggregator has a name which you could use to query the ValueColumnInternal for the right WrappedStatistic if they are stored in a Map<String, WrappedStatistic> in ValueColumnImpl.
Though I suppose each Aggregator will also need to store any other provided arguments like skipNaN: Boolean and percentile: Double when needed... In a Map<String, Any?> maybe?

That way we could store our "Statistics Cache" in ValueColumnImpl as a

Map<String, Map<Map<String, Any?>, Any?>>

so the result cache could look like:

{
   "max" : {
        { "skipNaN": true } : 312.4
    },
   "min" : {},
    "std" : {
        { "std": 0.9, "skipNaN": false } : Double.NaN,
        { "std": 0.9, "skipNaN": true } : 12.3
    }
}

The challenge may lie in doing this neatly ;P

Copy link
Contributor Author

@CarloMariaProietti CarloMariaProietti Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing!

Imo using Map<String, Map<Map<String, Any?>, Any?>> introduces a problem,
making a query to this structure does not allow to know if the statistic was computed 'in the past'.
Computing the stat using aggregateSequence implies that the stat can be null, so making a query and getting null does not tell me whether the stat was computed yet.

Maybe it could be Map<String, Map<Map<String, Any?>, WrappedStatistic>>
where WrappedStatistic has two fileds : wasComputed: Boolean and actualStatistic: Any? ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're gonna put the statistic result in a wrapped class anyway, you can also make the value type of the map nullable :) Like

value class StatisticResult(val value: Any?) and Map<..., StatisticResult?>. Then if map[key] == null, we know it has not been calculated, else we can take map[key].value. Saves you a lot of boolean-juggling :)

import kotlin.reflect.KType
import kotlin.reflect.full.withNullability

public class WrappedStatistic(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class should not be public, should it?

Also, I think, if you make the other a var, this can be a data class with var's. It's a bit more kotlin-like :)

public var wasComputedNotSkippingNaN: Boolean = false,
public var statisticComputedSkippingNaN: Any? = null,
public var statisticComputedNotSkippingNaN: Any? = null,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about std and percentile that take extra arguments?

@CarloMariaProietti
Copy link
Contributor Author

CarloMariaProietti commented Dec 18, 2025

I have done some refactoring at aggregators level, now it should work according to #1636 (comment), now also min exploits cache.

ParameterValue is a class created so that it can override equals and hashCode -> I can correctly query the statistic cache.

@CarloMariaProietti
Copy link
Contributor Author

I have closed and reopened this PR beacause I thought 'Files changed' section was 'bugged', I was wrong :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lazy statistics for columns

2 participants