Skip to content

Conversation

@Joyakis
Copy link
Contributor

@Joyakis Joyakis commented Nov 5, 2025

Fixes

Description

Adds fetching data for Wikicommons following conventions of other fetch scripts

Tests

  • No API key is required for Wikimedia Commons.
    To run the script :
    pipenv run ./scripts/1-fetch/wikicommons_fetch.py
    
  • Verify that a CSV file is generated at: data/2025Q4/1-fetch/wikicommons_fetch.csv
  • Confirm the CSV file contains the headers: LICENSE_TYPE, FILE_COUNT, PAGE_COUNT
  • Ensure the file uses UTF-8 encoding and Unix newlines (\n).

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@Joyakis Joyakis requested review from a team as code owners November 5, 2025 14:37
@Joyakis Joyakis requested review from Shafiya-Heena and TimidRobot and removed request for a team November 5, 2025 14:37
@cc-open-source-bot cc-open-source-bot moved this to In review in TimidRobot Nov 5, 2025
@Goziee-git

This comment was marked as outdated.

@TimidRobot TimidRobot changed the title Add Wikimedia automation Add Wikimedia fetch Nov 5, 2025
@TimidRobot TimidRobot self-assigned this Nov 5, 2025
@TimidRobot TimidRobot changed the title Add Wikimedia fetch Add WikiCommons fetch Nov 5, 2025
@TimidRobot

This comment was marked as outdated.

@Joyakis
Copy link
Contributor Author

Joyakis commented Nov 6, 2025

  • I added or updated tests for the changes I made (if applicable).

@Joyakis I don't see any added or updated tests

I’ve now added the test instructions under the Tests section in the PR description

@TimidRobot
Copy link
Member

  • No API key is required for Wikimedia Commons.
    To run the script and save the output :
    pipenv run python scripts/1-fetch/wikicommons_fetch.py --enable-save

This is why you didn't initially make the file executable. This isn't how the documentation describes Running the scripts

@Joyakis
Copy link
Contributor Author

Joyakis commented Nov 6, 2025

  • No API key is required for Wikimedia Commons.
    To run the script and save the output :
    pipenv run python scripts/1-fetch/wikicommons_fetch.py --enable-save

This is why you didn't initially make the file executable. This isn't how the documentation describes Running the scripts

Ah I see...that explains it.Thank you for the clarification made.
I have since updated it and will be more keen

Comment on lines 169 to 172
if count == 0:
LOGGER.warning(f"Skipping {category} — 0 {label} found.")
else:
LOGGER.info(f"Fetched {count} {label} for {category}.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I better understand what is going on here.

Please remove the warning (and keep the info).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello..so what it does is that it retrieves all subcategories for the current WikiCommons category, counts how many were found, and logs that information before deciding whether to continue the recursive traversal. It first calls get_subcategories() to fetch the list, determines the number of results, and labels them as either “categories” (at the top level) or “subcategories” (for deeper levels). If none are found, the script logs that the category has zero subcategories and does not recurse further; if some are found, it logs how many were retrieved and then recursively processes each one.

Copy link
Member

@TimidRobot TimidRobot Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reinstate the logic:

        if count > 0:
            LOGGER.info(f"Fetched {count} {label} for {category}.")

@TimidRobot
Copy link
Member

The data fetched needs a little bit of cleanup. Please ensure LICENSE_TYPE is limited to legal tools. I suspect you can do this by ensuring each category is also an instance of "open license".

@TimidRobot
Copy link
Member

Please update sources.md

@Joyakis
Copy link
Contributor Author

Joyakis commented Nov 14, 2025

The data fetched needs a little bit of cleanup. Please ensure LICENSE_TYPE is limited to legal tools. I suspect you can do this by ensuring each category is also an instance of "open license".

Hello @TimidRobot Well the the LICENSE_TYPE column contains hierarchical paths for each Creative Commons license...and their subcategories so would you like only the toplevel categories and not the subcategories?

@Joyakis
Copy link
Contributor Author

Joyakis commented Nov 14, 2025

Please update sources.md

Okay.I have included it

sources.md Outdated
Comment on lines 172 to 186
## Wikimedia Commons

**Description:** Wikimedia Commons is a repository of free-to-use media files. Its API allows users to query files, categories, metadata, and license information. You can retrieve statistics such as file counts, page counts, categories, and subcategories. The API runs on the MediaWiki Action API, similar to Wikipedia, and provides access to information about media files, licenses, and categories across Wikimedia projects.

**API documentation link:**
[WIKIMEDIA_BASE_URL documentation](https://en.wikipedia.org/w/api.php)
[WIKIMEDIA_BASE_URL reference page](https://www.mediawiki.org/wiki/API:Action_API)


**API information**

- No API key required
- Query limit: Rate-limited to prevent abuse
- Data available in XML or JSON format
- Can query file metadata, category members, and license types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sources are sorted (this should go before Wikipedia)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@TimidRobot
Copy link
Member

The data fetched needs a little bit of cleanup. Please ensure LICENSE_TYPE is limited to legal tools. I suspect you can do this by ensuring each category is also an instance of "open license".

Hello @TimidRobot Well the the LICENSE_TYPE column contains hierarchical paths for each Creative Commons license...and their subcategories so would you like only the toplevel categories and not the subcategories?

@Joyakis Please review the data fetched. The following are not valid legal tools:

  • Free_Creative_Commons_licenses
  • CC-BY-SA-3.0-migrated
  • Photographs by Agencia Brasil
  • Agência Brasil related uploads affected by license change
  • Images by TV Brasil
  • etc.

@Joyakis
Copy link
Contributor Author

Joyakis commented Nov 24, 2025

The data fetched needs a little bit of cleanup. Please ensure LICENSE_TYPE is limited to legal tools. I suspect you can do this by ensuring each category is also an instance of "open license".

Hello @TimidRobot Well the the LICENSE_TYPE column contains hierarchical paths for each Creative Commons license...and their subcategories so would you like only the toplevel categories and not the subcategories?

@Joyakis Please review the data fetched. The following are not valid legal tools:

  • Free_Creative_Commons_licenses
  • CC-BY-SA-3.0-migrated
  • Photographs by Agencia Brasil
  • Agência Brasil related uploads affected by license change
  • Images by TV Brasil
  • etc.

I have pushed the changes reflecting only the legal tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Add WikiCommons Data Source

3 participants