Skip to content

Add tests priorization to prevent testing all formats#155

Draft
Pierlou wants to merge 11 commits intomainfrom
refactor/linked-checks
Draft

Add tests priorization to prevent testing all formats#155
Pierlou wants to merge 11 commits intomainfrom
refactor/linked-checks

Conversation

@Pierlou
Copy link
Collaborator

@Pierlou Pierlou commented Aug 28, 2025

Instead of testing all formats for all columns, we introduce a parent-child relationship between tests, for instance: float > latitude_wgs > latitude_wgs_fr_metropole, with the PARENT attribute in the children's __init__.py files. This is then used as follows:

  • for each column, perform the tests that are not parent of another test
  • for each of these tests then go up to the parent and:
    • if the child's score is 0, perform the parent test
    • if the child's score is > 0, set the parent score to the child's score (because the label/header testing can prove the field testing wrong later on, we can't just leave the parent to 0)
    • and so on for all parents

This way if a column is latitude_wgs_fr_metropole we only perform a float test on the full column once (instead of three times).
This is not perfect: setting the child's score as the parent score can be misleading (there are many float numbers that are invalid latitudes), but it cuts much time and processing. A safer approach could be to start with parent tests, and only perform the children if the parent is successful

@Pierlou Pierlou requested review from bolinocroustibat and maudetes and removed request for maudetes August 28, 2025 15:26
Copy link
Contributor

@bolinocroustibat bolinocroustibat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

For the future, as a proposal, we could refactor in OOP, here are some organization ideas:

A parent class TestColumnType

  • with a class method using __subclasses__() to get all the childs
  • with a class method to get all the parents
    Something like :
class TestColumnType:

    ...

    @classmethod
    def get_all_parents(cls):
        return cls.__bases__ # Or we could return a sorted dict for parents tree?

    @classmethod
    def get_all_childs(cls):
        return set(cls.__subclasses__()).union(
            [s for c in cls.__subclasses__() for s in c.get_all_childs()])
      
     def value_is(val: Any) -> bool:
        return isinstance(val, str)

And children classes like:

class TestColumnInt(TestColumnType):

    def value_is(val: Any) -> bool:
        return isinstance(val, int)

class TestColumnCodePostal(TestColumnInt):

    def value_is(val: Any) -> bool:
        return _code_postal.is_valid(val)

    def header_is(header: str) -> float:
        words_combinations_list = [
            "code postal",
            "postal code",
            "postcode",
            "post code",
            "cp",
            "codes postaux",
            "location postcode",
        ]
        return header_score(header, words_combinations_list)

All child classes have:

  • a method value_is(value: Any) -> bool
  • a method header_is(header: str) -> float

@Pierlou
Copy link
Collaborator Author

Pierlou commented Aug 29, 2025

Agreed that a clean refactor using classes should be done in an upcoming PR, let's brainstorm on that in the coming weeks :)

@Pierlou Pierlou marked this pull request as draft March 12, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants