Skip to content

Conversation

@asweet-confluent
Copy link
Contributor

@asweet-confluent asweet-confluent commented Sep 29, 2025

Description

Refactors the remote module:

  • The entire RemoteEntry trait has been deleted.
  • URL construction logic has been moved into the provider implementations.
  • Gets rid of the fetch and fetch_with_early_exit, get_entry, get_entries_with_page functions. Instead, provide a simple get_json that GETs the provided URL and parses the response JSON into the given type. That way, RemoteClient is just a dumb HTTP client until something implements the other functions in the trait.
  • Adds the commits_stream and pull_requests_stream function signatures to the RemoteClient trait, making commits and PRs first-class abstractions instead of abstracting around generic, pageable resources the way RemoteEntry did.
  • Instead of returning a vector of results or a single object that wraps around the result vector, abstract that away by exposing the results as a stream. That way, consumers of RemoteClient objects don't need to know the details of how the API responses are structured - all they see is a stream of results. This resolves the shenanigans around fetch_with_early_exit, early_exit, etc.
  • Moves the response handling logic into the provider implementations instead of trying to do it generically in RemoteClient.

Motivation and Context

I started out trying to address #422. There are a few ways to do that:

  • Implement rate limiting controls.
  • Try to be smarter about checking if someone has contributed before. For example, if you fetch a few pages of history and find that all the authors in the changelog have already contributed, you don't need to fetch the rest of the history.
  • Only fetch the complete history if the template actually references is_first_time.

In trying to figure out how to implement any of those cleanly, I found the current abstractions to be a bit of a straight jacket. It seems like the intention was to reuse code, but it makes it difficult to modify behavior based on the different features provided by each provider's API.

This should be a no-op: the provider implementations use the new stream functions, but they still collect the entire stream into a list like they did before. The refactor does add a marginal amount of code duplication, but hopefully it clears the way for further enhancements.

How Has This Been Tested?

In addition to the existing tests, I ran a quick check with the GitHub integration and made sure that commit and PR count stayed the same after the refactor:

 DEBUG git_cliff_core::changelog   > Number of GitHub commits: 1519
 DEBUG git_cliff_core::changelog   > Number of GitHub pull requests: 925

Types of Changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (no code change)
  • Refactor (refactoring production code)
  • Other

Checklist:

  • My code follows the code style of this project.
  • I have updated the documentation accordingly.
  • I have formatted the code with rustfmt.
  • I checked the lints with clippy.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@welcome
Copy link

welcome bot commented Sep 29, 2025

Thanks for opening this pull request! Please check out our contributing guidelines! ⛰️

@codecov-commenter
Copy link

codecov-commenter commented Sep 29, 2025

Codecov Report

❌ Patch coverage is 3.27103% with 207 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.31%. Comparing base (8270084) to head (71479fa).

Files with missing lines Patch % Lines
git-cliff-core/src/remote/gitlab.rs 11.67% 53 Missing ⚠️
git-cliff-core/src/remote/bitbucket.rs 0.00% 49 Missing ⚠️
git-cliff-core/src/remote/gitea.rs 0.00% 49 Missing ⚠️
git-cliff-core/src/remote/github.rs 0.00% 49 Missing ⚠️
git-cliff-core/src/changelog.rs 0.00% 5 Missing ⚠️
git-cliff-core/src/remote/mod.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1272      +/-   ##
==========================================
- Coverage   44.32%   43.31%   -1.00%     
==========================================
  Files          22       22              
  Lines        2033     2085      +52     
==========================================
+ Hits          901      903       +2     
- Misses       1132     1182      +50     
Flag Coverage Δ
unit-tests 43.31% <3.28%> (-1.00%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@orhun
Copy link
Owner

orhun commented Sep 30, 2025

Heya!

In trying to figure out how to implement any of those cleanly, I found the current abstractions to be a bit of a straight jacket.
[...]
The refactor does add a marginal amount of code duplication, but hopefully it clears the way for further enhancements.

I'm happy that someone was interested in tackling it! So thanks for working on this.

The changes look pretty good already and code is in much better shape. I'll try to add a throughout review soon.

cc @ognis1205 if you're interested in taking a look at "this is how async is done"

@asweet-confluent
Copy link
Contributor Author

I noticed there are still some unused function arguments floating around; I'll get those cleaned up.

@ognis1205
Copy link
Contributor

Heya!

In trying to figure out how to implement any of those cleanly, I found the current abstractions to be a bit of a straight jacket.
[...]
The refactor does add a marginal amount of code duplication, but hopefully it clears the way for further enhancements.

I'm happy that someone was interested in tackling it! So thanks for working on this.

The changes look pretty good already and code is in much better shape. I'll try to add a throughout review soon.

cc @ognis1205 if you're interested in taking a look at "this is how async is done"

Hey @orhun , thanks for sharing! I’ll take a look at this later as well.

Copy link
Contributor

@ognis1205 ognis1205 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have left comments directly on the relevant parts of the source code, but since the same methods are being added to each backend, I’m leaving a consolidated review comment here.

I started out trying to address #422. There are a few ways to do that:

  • Implement rate limiting controls.
  • Try to be smarter about checking if someone has contributed before. For example, if you fetch a few pages of history and find that all the authors in the changelog have already contributed, you don't need to fetch the rest of the history.
  • Only fetch the complete history if the template actually references is_first_time.

Given the characteristics of a CLI application—which issues one-off, multiple requests to fetch necessary data, rather than acting like a web app—there may be room to design a somewhat smarter API request strategy. That said, I believe the more robust and conventional approach is to recommend the use of PATs.

  • Unpredictable request counts
  • Shared IP side effects
  • Limitations of caching
  • Limitations of wait/retry strategies

For these reasons, while it may be useful to implement some smart request strategies, I think recommending PAT usage is the most solid approach:

  • Significant relaxation of rate limits
  • More stable user experience
  • Consideration for the user

This should be a no-op: the provider implementations use the new stream functions, but they still collect the entire stream into a list like they did before. The refactor does add a marginal amount of code duplication, but hopefully it clears the way for further enhancements.

Although backend APIs are basically pagination-based REST APIs, I agree with handling them as streams in git-cliff. On the other hand:

  • Adds the commits_stream and pull_requests_stream function signatures to the RemoteClient trait, making commits and PRs first-class abstractions instead of abstracting around generic, pageable resources the way RemoteEntry did.

I think the trait should expose methods equivalent to get_commits and get_pull_requests, since from outside the module these are already being used as the primary surface. In the current implementation, the method signatures differ for each backend, but for example:

struct GitHubClientConfig {
  ...
}
GitHubClient::configure(config: GitHubClientConfig);

In this style, the parameters required for get_commits and get_pull_requests can be set when creating the client. This allows the trait to simply define get_commits(&self) and get_pull_requests(&self).

From a design perspective, I agree with adopting streams overall. As @orhun implicitly noted (and @asweet-confluent described in the PR implicitly), simply switching to streams at this layer has no impact unless the downstream code also processes data in a streaming fashion. For example:

  • let data = tokio::runtime::Builder::new_multi_thread()
    .enable_all()
    .build()?
    .block_on(async {
    let (commits, pull_requests) = tokio::try_join!(
    github_client.get_commits(ref_name),
    github_client.get_pull_requests(ref_name),
    )?;
    log::debug!("Number of GitHub commits: {}", commits.len());
    log::debug!("Number of GitHub pull requests: {}", pull_requests.len());
    Ok((commits, pull_requests))
    });
    log::info!("{}", github::FINISHED_FETCHING_MSG);
    data
  • pub fn $fn(
    &mut self,
    mut commits: Vec<Box<dyn RemoteCommit>>,
    pull_requests: Vec<Box<dyn RemotePullRequest>>,
    ) -> Result<()> {

Refactoring downstream logic to stream-based implementations at these two points in particular would likely yield performance improvements, which I’d welcome. On the other hand, this increases both scope and implementation effort, so I suggest opening a tracking issue and dividing the refactoring into three parts:

  • Redesign of the remote client API
  • Stream adoption in downstream consumers of the client
  • Changes to progress bars and logging

That way, this PR can remain a no-op change and be reviewed as such.

To reiterate, as a suggestion for this PR, please reconsider the public methods of each backend client. Since get_commits / get_pull_requests and commits_stream / pull_requests_stream are almost identical in logic, the current API feels somewhat redundant. What do you think?

@asweet-confluent
Copy link
Contributor Author

I think recommending PAT usage is the most solid approach:

By PAT, do you mean personal access token? I suppose that would raise the rate limits, but it's a band-aid. At least for GitHub, respecting the rate limit headers would go a long way towards addressing the problem. I do see some mentions of a "secondary rate limit", but it looks like it should be simple enough to manage for git-cliff's use case.

Personally, I haven't actually run into the rate limiting issue because I already use a PAT. My problem is that enabling the GitHub integration causes git-cliff HEAD~1..HEAD execution time to go from 0.1 seconds to 15 seconds. It's a little bit silly. I'm guessing that reducing the number of requests made will resolve the rate limiting issues for most changelogs. For others, we still have to manage rate limits. Anyway, there are lots of incremental improvements to be made. Recommending use of PATs is fine, but it doesn't resolve the underlying problems.

I think the trait should expose methods equivalent to get_commits and get_pull_requests, since from outside the module these are already being used as the primary surface.

If I'm understanding correctly, you're suggesting something like this?

pub trait RemoteClient {
  type Config;  // Associated type for configuration
    
  fn config(&self) -> &Self::Config;

 // These call config() and do something with it
  fn get_commits() -> impl Stream
  fn get_pull_requests() -> impl Stream
}

pub struct GitHubConfig {
  // ...
}
pub struct GitHubClient {
  config: GitHubConfig
}

impl RemoteClient for GitHubClient {
  fn get_commits() -> impl Stream {
    // use config to determine which and how many commits should be fetched
  }
}

In this style, the parameters required for get_commits and get_pull_requests can be set when creating the client. This allows the trait to simply define get_commits(&self) and get_pull_requests(&self).

Is there some benefit to this indirection? It seems like it's basically just moving the function arguments into a separate struct that needs to be configured through a separate function. Why not something like:

pub struct GitHubClient {
  // Optionally filter commits by author. This could be used to reduce the number of calls needed
  // to determine is_first_time
  fn get_commits(author: Option<&str>) -> impl Stream {
    // ...
  }
}

pub struct BitbucketClient {
  // Filtering by author isn't supported in the BitBucket API, so we'd probably
  // need to get the full list of commits.
  fn get_commits() -> impl Stream {
    // ...
  }
}

I suggest opening a tracking issue and dividing the refactoring into three parts:

Redesign of the remote client API
Stream adoption in downstream consumers of the client
Changes to progress bars and logging

To be clear, this PR is the first entry you listed, right? Either way, I'm happy to break up the PR if needed.

Since get_commits / get_pull_requests and commits_stream / pull_requests_stream are almost identical in logic, the current API feels somewhat redundant. What do you think?

I agree; I think the commits_stream and pull_requests_stream methods in RemoteClient are providing little value. Would you be open to removing them from RemoteClient as I described above?

@ognis1205
Copy link
Contributor

Thanks for the reply, @asweet-confluent !

To be clear, this PR is the first entry you listed, right? Either way, I'm happy to break up the PR if needed.

I agree with adopting the stream approach for handling remote metadata. And as you pointed out, I also agree with reviewing this PR as a redesign of the remote client API. Rather than breaking up this PR arbitrarily, I’d suggest creating a tracking issue with a checklist for the following items:

  • Redesign of the remote client API
  • Stream adoption in downstream consumers of the client
  • Changes to progress bars and logging

If I'm understanding correctly, you're suggesting something like this?
Is there some benefit to this indirection? It seems like it's basically just moving the function arguments into a separate struct that needs to be configured through a separate function. Why not something like:

If we adopt streams throughout all downstream processing, the changes in the two places would likely be necessary:

  • let data = tokio::runtime::Builder::new_multi_thread()
    .enable_all()
    .build()?
    .block_on(async {
    let (commits, pull_requests) = tokio::try_join!(
    github_client.get_commits(ref_name),
    github_client.get_pull_requests(ref_name),
    )?;
    log::debug!("Number of GitHub commits: {}", commits.len());
    log::debug!("Number of GitHub pull requests: {}", pull_requests.len());
    Ok((commits, pull_requests))
    });
    log::info!("{}", github::FINISHED_FETCHING_MSG);
    data
  • pub fn $fn(
    &mut self,
    mut commits: Vec<Box<dyn RemoteCommit>>,
    pull_requests: Vec<Box<dyn RemotePullRequest>>,
    ) -> Result<()> {

Ultimately, the arguments in the second link would likely just be the client instance or the stream itself. In that case, external modules are mainly interested in commits_stream and pull_requests_stream from the remote module; they generally don't care about the request details for each backend. In other words, when calling commits_stream or pull_requests_stream, it’s desirable that all parameters required for the request are already set.

In this PR, external modules effectively call through get_commits and get_pull_requests. However, if we move toward a design where commits_stream and pull_requests_stream can be called directly, I think it makes sense to configure the necessary information when instantiating the client using a Config.

I agree; I think the commits_stream and pull_requests_stream methods in RemoteClient are providing little value. Would you be open to removing them from RemoteClient as I described above?

Building on what I mentioned earlier, after reviewing the changes in this PR more closely, it seems that we might not need a Config after all. In fact, leaving commits_stream and pull_requests_stream as-is could better serve the intention I outlined above.

@asweet-confluent
Copy link
Contributor Author

I’d suggest creating a tracking issue with a checklist for the following items:

Will do.

In that case, external modules are mainly interested in commits_stream and pull_requests_stream from the remote module; they generally don't care about the request details for each backend. In other words, when calling commits_stream or pull_requests_stream, it’s desirable that all parameters required for the request are already set.

My hope is that we'll be able to take advantages of features offered by some providers to improve performance. For example, your comment earlier:

That said, one detail worth paying attention to is the calculation of first_contributors: since it relies on having complete remote information
To calculate the first contributor from a stream, should we create a counter map and refer to it?

Exactly! But the worst case scenario (a new contributor) still requires fetching the entire history. You can optimize even further by taking advantage of the author and until filters offered by GitHub and GitLab - then you should need to make, at most, one request per author in the changelog to determine whether someone is a new contributor. You could even take that a step further: I would bet that most contributors in a given changelog are either frequent or semi-frequent contributors, and a smaller portion are infrequent or new contributors (this is just a hunch and it may depend on the project). If that's the case, then most authors for a changelog should appear within the first n commits in history just before the changelog. The optimal path may be to request, say, 3 pages of commits, then for any author that isn't found, make a request with the author filter set.
Anyway, my point is that not every provider supports the author filter - I believe only GitHub and GitLab do. So the code you linked here would have a different implementation for different providers, and would make different API calls with different filters.

Ultimately, the arguments in the second link would likely just be the client instance or the stream itself. In that case, external modules are mainly interested in commits_stream and pull_requests_stream from the remote module; they generally don't care about the request details for each backend. In other words, when calling commits_stream or pull_requests_stream, it’s desirable that all parameters required for the request are already set.

Okay, I think I see the disconnect here.I honestly hadn't even considered exposing the stream at that point in the pipeline; I was more focused on clearing the path to reduce the number of API requests. I was going to let the get_<provider>)_metadata functions in changelog.rs consume from the streams, filter things down to the relevant commits, and then just collect the commits and PRs into Vecs as it's doing now. I see the value in exposing a stream all the way down the consumer pipeline, and you're absolutely right, update_release_metadata itself should not have to know about all the nuances of the API - it should just get a plain list of commits and PRs. So it seems like we'd need two interfaces: a stream interface that directly wraps around the HTTP client and exposes parameters that are specific to the provider HTTP API, and a stream interface that's exposed outside of changelog.rs which exposes plain get_commits/get_prs functions.

@ognis1205
Copy link
Contributor

Ultimately, I believe @orhun will decide on the design direction, but I wanted to share my suggestion in case it's useful for this PR's discussion. Thank you @asweet-confluent for your thoughtful response — I've understood your points and intentions!

@asweet-confluent
Copy link
Contributor Author

@orhun I cleaned it up a bit; please review at your leisure.

Copy link
Owner

@orhun orhun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another look and it LGTM!

Hopefully there won't be any side effects. (If there is, get ready to be pinged :D)

@orhun orhun merged commit b82221a into orhun:main Nov 4, 2025
91 checks passed
@welcome
Copy link

welcome bot commented Nov 4, 2025

Congrats on merging your first pull request! ⛰️

amd989 added a commit to amd989/git-cliff that referenced this pull request Nov 4, 2025
@amd989 amd989 mentioned this pull request Nov 12, 2025
2 tasks
orhun added a commit that referenced this pull request Nov 21, 2025
* feat: ✨ add support for azure devops

Introducing integration with Azure DevOps to bridge the gap in obtaining commit information

* test: ✅ add tests for azure devops integration

Provide the necesary fixtures to test the functionality of azure devops

* style: 🚨 run clippy and rustfmt on the solution

* docs: 📝 add azure devops documentation

Adding md file for azure dev ops

* test: ✅ add cargo tests

adding azure devops unit tests

* test: ✅ add tests to improve code coverage

* chore: 🚨 run linters

* test: ✅ add more tests to improve code coverage

* fix: add executable permission to Azure DevOps fixture scripts

* test: ✅ add missing azure devops values in the context files

* docs: ✏️ remove the link to the example devops toml as it does not exist yet

* fix: implement changes from #1272

* chore: remove unnecessary AI-generated tests

* docs: update the arguments

---------

Co-authored-by: Orhun Parmaksız <orhunparmaksiz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants