-
-
Notifications
You must be signed in to change notification settings - Fork 257
refactor(remote): expose commits and PRs as streams #1272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening this pull request! Please check out our contributing guidelines! ⛰️ |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1272 +/- ##
==========================================
- Coverage 44.32% 43.31% -1.00%
==========================================
Files 22 22
Lines 2033 2085 +52
==========================================
+ Hits 901 903 +2
- Misses 1132 1182 +50
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Heya!
I'm happy that someone was interested in tackling it! So thanks for working on this. The changes look pretty good already and code is in much better shape. I'll try to add a throughout review soon. cc @ognis1205 if you're interested in taking a look at "this is how async is done" |
|
I noticed there are still some unused function arguments floating around; I'll get those cleaned up. |
Hey @orhun , thanks for sharing! I’ll take a look at this later as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have left comments directly on the relevant parts of the source code, but since the same methods are being added to each backend, I’m leaving a consolidated review comment here.
I started out trying to address #422. There are a few ways to do that:
- Implement rate limiting controls.
- Try to be smarter about checking if someone has contributed before. For example, if you fetch a few pages of history and find that all the authors in the changelog have already contributed, you don't need to fetch the rest of the history.
- Only fetch the complete history if the template actually references is_first_time.
Given the characteristics of a CLI application—which issues one-off, multiple requests to fetch necessary data, rather than acting like a web app—there may be room to design a somewhat smarter API request strategy. That said, I believe the more robust and conventional approach is to recommend the use of PATs.
- Unpredictable request counts
- Shared IP side effects
- Limitations of caching
- Limitations of wait/retry strategies
For these reasons, while it may be useful to implement some smart request strategies, I think recommending PAT usage is the most solid approach:
- Significant relaxation of rate limits
- More stable user experience
- Consideration for the user
This should be a no-op: the provider implementations use the new stream functions, but they still collect the entire stream into a list like they did before. The refactor does add a marginal amount of code duplication, but hopefully it clears the way for further enhancements.
Although backend APIs are basically pagination-based REST APIs, I agree with handling them as streams in git-cliff. On the other hand:
- Adds the commits_stream and pull_requests_stream function signatures to the RemoteClient trait, making commits and PRs first-class abstractions instead of abstracting around generic, pageable resources the way RemoteEntry did.
I think the trait should expose methods equivalent to get_commits and get_pull_requests, since from outside the module these are already being used as the primary surface. In the current implementation, the method signatures differ for each backend, but for example:
struct GitHubClientConfig {
...
}GitHubClient::configure(config: GitHubClientConfig);In this style, the parameters required for get_commits and get_pull_requests can be set when creating the client. This allows the trait to simply define get_commits(&self) and get_pull_requests(&self).
From a design perspective, I agree with adopting streams overall. As @orhun implicitly noted (and @asweet-confluent described in the PR implicitly), simply switching to streams at this layer has no impact unless the downstream code also processes data in a streaming fashion. For example:
git-cliff/git-cliff-core/src/changelog.rs
Lines 262 to 275 in 0b6af12
let data = tokio::runtime::Builder::new_multi_thread() .enable_all() .build()? .block_on(async { let (commits, pull_requests) = tokio::try_join!( github_client.get_commits(ref_name), github_client.get_pull_requests(ref_name), )?; log::debug!("Number of GitHub commits: {}", commits.len()); log::debug!("Number of GitHub pull requests: {}", pull_requests.len()); Ok((commits, pull_requests)) }); log::info!("{}", github::FINISHED_FETCHING_MSG); data git-cliff/git-cliff-core/src/remote/mod.rs
Lines 310 to 314 in 0b6af12
pub fn $fn( &mut self, mut commits: Vec<Box<dyn RemoteCommit>>, pull_requests: Vec<Box<dyn RemotePullRequest>>, ) -> Result<()> {
Refactoring downstream logic to stream-based implementations at these two points in particular would likely yield performance improvements, which I’d welcome. On the other hand, this increases both scope and implementation effort, so I suggest opening a tracking issue and dividing the refactoring into three parts:
- Redesign of the remote client API
- Stream adoption in downstream consumers of the client
- Changes to progress bars and logging
That way, this PR can remain a no-op change and be reviewed as such.
To reiterate, as a suggestion for this PR, please reconsider the public methods of each backend client. Since get_commits / get_pull_requests and commits_stream / pull_requests_stream are almost identical in logic, the current API feels somewhat redundant. What do you think?
By PAT, do you mean personal access token? I suppose that would raise the rate limits, but it's a band-aid. At least for GitHub, respecting the rate limit headers would go a long way towards addressing the problem. I do see some mentions of a "secondary rate limit", but it looks like it should be simple enough to manage for Personally, I haven't actually run into the rate limiting issue because I already use a PAT. My problem is that enabling the GitHub integration causes
If I'm understanding correctly, you're suggesting something like this? pub trait RemoteClient {
type Config; // Associated type for configuration
fn config(&self) -> &Self::Config;
// These call config() and do something with it
fn get_commits() -> impl Stream
fn get_pull_requests() -> impl Stream
}
pub struct GitHubConfig {
// ...
}
pub struct GitHubClient {
config: GitHubConfig
}
impl RemoteClient for GitHubClient {
fn get_commits() -> impl Stream {
// use config to determine which and how many commits should be fetched
}
}
Is there some benefit to this indirection? It seems like it's basically just moving the function arguments into a separate struct that needs to be configured through a separate function. Why not something like: pub struct GitHubClient {
// Optionally filter commits by author. This could be used to reduce the number of calls needed
// to determine is_first_time
fn get_commits(author: Option<&str>) -> impl Stream {
// ...
}
}
pub struct BitbucketClient {
// Filtering by author isn't supported in the BitBucket API, so we'd probably
// need to get the full list of commits.
fn get_commits() -> impl Stream {
// ...
}
}
To be clear, this PR is the first entry you listed, right? Either way, I'm happy to break up the PR if needed.
I agree; I think the |
|
Thanks for the reply, @asweet-confluent !
I agree with adopting the stream approach for handling remote metadata. And as you pointed out, I also agree with reviewing this PR as a redesign of the remote client API. Rather than breaking up this PR arbitrarily, I’d suggest creating a tracking issue with a checklist for the following items:
If we adopt streams throughout all downstream processing, the changes in the two places would likely be necessary:
Ultimately, the arguments in the second link would likely just be the client instance or the stream itself. In that case, external modules are mainly interested in In this PR, external modules effectively call through
Building on what I mentioned earlier, after reviewing the changes in this PR more closely, it seems that we might not need a Config after all. In fact, |
Will do.
My hope is that we'll be able to take advantages of features offered by some providers to improve performance. For example, your comment earlier:
Exactly! But the worst case scenario (a new contributor) still requires fetching the entire history. You can optimize even further by taking advantage of the
Okay, I think I see the disconnect here.I honestly hadn't even considered exposing the stream at that point in the pipeline; I was more focused on clearing the path to reduce the number of API requests. I was going to let the |
|
Ultimately, I believe @orhun will decide on the design direction, but I wanted to share my suggestion in case it's useful for this PR's discussion. Thank you @asweet-confluent for your thoughtful response — I've understood your points and intentions! |
|
@orhun I cleaned it up a bit; please review at your leisure. |
c44ca4e to
71479fa
Compare
orhun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another look and it LGTM!
Hopefully there won't be any side effects. (If there is, get ready to be pinged :D)
|
Congrats on merging your first pull request! ⛰️ |
* feat: ✨ add support for azure devops Introducing integration with Azure DevOps to bridge the gap in obtaining commit information * test: ✅ add tests for azure devops integration Provide the necesary fixtures to test the functionality of azure devops * style: 🚨 run clippy and rustfmt on the solution * docs: 📝 add azure devops documentation Adding md file for azure dev ops * test: ✅ add cargo tests adding azure devops unit tests * test: ✅ add tests to improve code coverage * chore: 🚨 run linters * test: ✅ add more tests to improve code coverage * fix: add executable permission to Azure DevOps fixture scripts * test: ✅ add missing azure devops values in the context files * docs: ✏️ remove the link to the example devops toml as it does not exist yet * fix: implement changes from #1272 * chore: remove unnecessary AI-generated tests * docs: update the arguments --------- Co-authored-by: Orhun Parmaksız <orhunparmaksiz@gmail.com>
Description
Refactors the
remotemodule:RemoteEntrytrait has been deleted.fetchandfetch_with_early_exit,get_entry,get_entries_with_pagefunctions. Instead, provide a simpleget_jsonthatGETs the provided URL and parses the response JSON into the given type. That way,RemoteClientis just a dumb HTTP client until something implements the other functions in the trait.commits_streamandpull_requests_streamfunction signatures to theRemoteClienttrait, making commits and PRs first-class abstractions instead of abstracting around generic, pageable resources the wayRemoteEntrydid.RemoteClientobjects don't need to know the details of how the API responses are structured - all they see is a stream of results. This resolves the shenanigans aroundfetch_with_early_exit,early_exit, etc.RemoteClient.Motivation and Context
I started out trying to address #422. There are a few ways to do that:
is_first_time.In trying to figure out how to implement any of those cleanly, I found the current abstractions to be a bit of a straight jacket. It seems like the intention was to reuse code, but it makes it difficult to modify behavior based on the different features provided by each provider's API.
This should be a no-op: the provider implementations use the new stream functions, but they still collect the entire stream into a list like they did before. The refactor does add a marginal amount of code duplication, but hopefully it clears the way for further enhancements.
How Has This Been Tested?
In addition to the existing tests, I ran a quick check with the GitHub integration and made sure that commit and PR count stayed the same after the refactor:
Types of Changes
Checklist: