-
-
Notifications
You must be signed in to change notification settings - Fork 67
Add WikiCommons fetch #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
I’ve now added the test instructions under the Tests section in the PR description |
This is why you didn't initially make the file executable. This isn't how the documentation describes Running the scripts |
Ah I see...that explains it.Thank you for the clarification made. |
scripts/1-fetch/wikicommons_fetch.py
Outdated
| if count == 0: | ||
| LOGGER.warning(f"Skipping {category} — 0 {label} found.") | ||
| else: | ||
| LOGGER.info(f"Fetched {count} {label} for {category}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I better understand what is going on here.
Please remove the warning (and keep the info).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello..so what it does is that it retrieves all subcategories for the current WikiCommons category, counts how many were found, and logs that information before deciding whether to continue the recursive traversal. It first calls get_subcategories() to fetch the list, determines the number of results, and labels them as either “categories” (at the top level) or “subcategories” (for deeper levels). If none are found, the script logs that the category has zero subcategories and does not recurse further; if some are found, it logs how many were retrieved and then recursively processes each one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reinstate the logic:
if count > 0:
LOGGER.info(f"Fetched {count} {label} for {category}.")|
The data fetched needs a little bit of cleanup. Please ensure |
|
Please update |
Hello @TimidRobot Well the the LICENSE_TYPE column contains hierarchical paths for each Creative Commons license...and their subcategories so would you like only the toplevel categories and not the subcategories? |
Okay.I have included it |
sources.md
Outdated
| ## Wikimedia Commons | ||
|
|
||
| **Description:** Wikimedia Commons is a repository of free-to-use media files. Its API allows users to query files, categories, metadata, and license information. You can retrieve statistics such as file counts, page counts, categories, and subcategories. The API runs on the MediaWiki Action API, similar to Wikipedia, and provides access to information about media files, licenses, and categories across Wikimedia projects. | ||
|
|
||
| **API documentation link:** | ||
| [WIKIMEDIA_BASE_URL documentation](https://en.wikipedia.org/w/api.php) | ||
| [WIKIMEDIA_BASE_URL reference page](https://www.mediawiki.org/wiki/API:Action_API) | ||
|
|
||
|
|
||
| **API information** | ||
|
|
||
| - No API key required | ||
| - Query limit: Rate-limited to prevent abuse | ||
| - Data available in XML or JSON format | ||
| - Can query file metadata, category members, and license types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sources are sorted (this should go before Wikipedia)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
@Joyakis Please review the data fetched. The following are not valid legal tools:
|
I have pushed the changes reflecting only the legal tools |
Fixes
Description
Adds fetching data for Wikicommons following conventions of other fetch scripts
Tests
To run the script :
data/2025Q4/1-fetch/wikicommons_fetch.csvLICENSE_TYPE,FILE_COUNT,PAGE_COUNT\n).Checklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin