-
Notifications
You must be signed in to change notification settings - Fork 0
Release 0.1.0 #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
fbb70d3
work on new embedder
khoroshevskyi a4c5182
improve processing speed of the data
khoroshevskyi 13d571d
Added processed tracker
khoroshevskyi a17c6f1
cleaning and added cli
khoroshevskyi 453a332
work on logging
khoroshevskyi 9c1dc62
Added automatic runner, tests and other files
khoroshevskyi 81f82d6
lint
khoroshevskyi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| name: Lint | ||
|
|
||
| on: [pull_request] | ||
|
|
||
| jobs: | ||
| lint: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - uses: actions/setup-python@v5 | ||
| - uses: psf/black@stable |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| ## we can't run test, but lets just install all dependencies and package | ||
| name: Installation test | ||
|
|
||
| on: | ||
| push: | ||
| branches: [dev] | ||
| pull_request: | ||
| branches: [master, dev] | ||
|
|
||
| jobs: | ||
| pytest: | ||
| runs-on: ${{ matrix.os }} | ||
| strategy: | ||
| matrix: | ||
| python-version: ["3.10", "3.13"] | ||
| os: [ubuntu-latest] | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Python ${{ matrix.python-version }} | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
|
|
||
| - name: Install uv | ||
| run: pip install uv | ||
|
|
||
| - name: Install dev dependencies | ||
| run: if [ -f requirements/requirements-dev.txt ]; then uv pip install -r requirements/requirements-dev.txt --system; fi | ||
|
|
||
| - name: Install package | ||
| run: uv pip install . --system | ||
|
|
||
| - name: Run help | ||
| run: pepembed --help | ||
|
|
||
| # - name: Run pytest tests | ||
| # run: pytest tests -x -vv |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,3 @@ | ||
| """ Package-level data """ | ||
| from ._version import __version__ | ||
| import logmuse | ||
| """Package-level data""" | ||
|
|
||
| logmuse.init_logger("geofetch") | ||
| from ._version import __version__ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| import logging | ||
| import sys | ||
| import coloredlogs | ||
|
|
||
| from .argparser import app | ||
| from .const import PKG_NAME | ||
|
|
||
| _LOGGER = logging.getLogger(name=PKG_NAME) | ||
| _LOGGER.propagate = False | ||
| coloredlogs.install( | ||
| logger=_LOGGER, | ||
| datefmt="%H:%M:%S", | ||
| fmt="[%(levelname)s] [%(asctime)s] [PEPEMBED] %(message)s", | ||
| ) | ||
|
|
||
|
|
||
| # Add console handler to output logs | ||
| # handler = logging.StreamHandler() | ||
| # handler.setLevel(logging.INFO) | ||
| # formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') | ||
| # handler.setFormatter(formatter) | ||
| # _LOGGER.addHandler(handler) | ||
|
|
||
|
|
||
| def main(): | ||
| app(prog_name=PKG_NAME) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| try: | ||
| main() | ||
|
|
||
| except KeyboardInterrupt: | ||
| print("Pipeline aborted.") | ||
| sys.exit(1) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1 @@ | ||
| __version__ = "0.0.1" | ||
| __version__ = "0.1.0" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,132 +1,98 @@ | ||
| from ubiquerg import VersionInHelpParser | ||
| import logging | ||
| import os | ||
| from typing import Optional | ||
|
|
||
| from . import __version__ | ||
| from .const import * | ||
| from ._version import __version__ as pepembed_version | ||
|
|
||
|
|
||
| def build_argparser(): | ||
| banner = "%(prog)s - Run embedding on PEPs" | ||
| additional_description = "pephub.databio.org" | ||
|
|
||
| parser = VersionInHelpParser( | ||
| prog=PKG_NAME, | ||
| description=banner, | ||
| epilog=additional_description, | ||
| version=pepembed_version, | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--verbosity", | ||
| dest="verbosity", | ||
| type=int, | ||
| choices=range(len(LEVEL_BY_VERBOSITY)), | ||
| help="Choose level of verbosity (default: %(default)s)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--dbg", | ||
| dest="dbg", | ||
| action="store_true", | ||
| help="Enable debug mode (default: %(default)s)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "-m", | ||
| "--hf-model", | ||
| dest="hf_model", | ||
| default="sentence-transformers/all-MiniLM-L12-v2", | ||
| help="Huggingface model registry (default: %(default)s)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--keywords-file", | ||
| dest="keywords_file", | ||
| default=None, | ||
| help="File containing keywords to search for (default: %(default)s)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--postgres-host", | ||
| dest="postgres_host", | ||
| default=None, | ||
| help="Postgres host (default: %(default)s)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--postgres-port", | ||
| dest="postgres_port", | ||
| default=5432, | ||
| help="Postgres port (default: %(default)s)", | ||
| ) | ||
| import typer | ||
| from dotenv import load_dotenv | ||
|
|
||
| parser.add_argument( | ||
| "--postgres-user", | ||
| dest="postgres_user", | ||
| default=None, | ||
| help="Postgres user (default: %(default)s)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--postgres-password", | ||
| dest="postgres_password", | ||
| default=None, | ||
| help="Postgres password (default: %(default)s)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--postgres-db", | ||
| dest="postgres_db", | ||
| default=None, | ||
| help="Postgres database (default: %(default)s)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--qdrant-host", | ||
| dest="qdrant_host", | ||
| default=None, | ||
| help="Qdrant host (default: %(default)s)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--qdrant-port", | ||
| dest="qdrant_port", | ||
| default=None, | ||
| help="Qdrant port (default: %(default)s)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--qdrant-collection", | ||
| dest="qdrant_collection", | ||
| default=None, | ||
| help="Qdrant collection name (default: %(default)s)", | ||
| ) | ||
| parser.add_argument( | ||
| "--qdrant-api-key", | ||
| dest="qdrant_api_key", | ||
| default=None, | ||
| help="Qdrant API key (default: %(default)s)", | ||
| ) | ||
| from ._version import __version__ as pepembed_version | ||
| from .const import ( | ||
| DEFAULT_BATCH_SIZE, | ||
| DENSE_ENCODER_MODEL, | ||
| PKG_NAME, | ||
| QDRANT_DEFAULT_COLLECTION, | ||
| SPARSE_ENCODER_MODEL, | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--recreate-collection", | ||
| dest="recreate_collection", | ||
| action="store_true", | ||
| help="Recreate collection if it exists (default: %(default)s)", | ||
| ) | ||
| _LOGGER = logging.getLogger(PKG_NAME) | ||
|
|
||
| parser.add_argument( | ||
| "--batch-size", | ||
| dest="batch_size", | ||
| default=100, | ||
| help="Batch size for embedding (default: %(default)s)", | ||
| ) | ||
| app = typer.Typer( | ||
| name=PKG_NAME, | ||
| help="Run embedding on PEPs", | ||
| epilog="pephub.databio.org", | ||
| add_completion=False, | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--upsert-batch-size", | ||
| dest="upsert_batch_size", | ||
| default=1000, | ||
| help="Batch size for upserting embeddings into qdrant (default: %(default)s)", | ||
| ) | ||
|
|
||
| return parser | ||
| def build_argparser(): | ||
| """ | ||
| Build and return the typer app for CLI argument parsing. | ||
| This function maintains compatibility with the original argparse interface. | ||
| """ | ||
| return app | ||
|
|
||
|
|
||
| def version_callback(value: bool): | ||
| if value: | ||
khoroshevskyi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| typer.echo(f"pepembed version: {pepembed_version}") | ||
| raise typer.Exit() | ||
|
|
||
|
|
||
| @app.command() | ||
| def main( | ||
| qdrant_collection: Optional[str] = typer.Option( | ||
| None, | ||
| help="Qdrant collection name", | ||
| ), | ||
| recreate_collection: bool = typer.Option( | ||
| True, | ||
| help="Recreate collection if it exists", | ||
| ), | ||
| batch_size: int = typer.Option( | ||
| DEFAULT_BATCH_SIZE, | ||
| help="Batch size for embedding", | ||
| ), | ||
| dense_model: Optional[str] = typer.Option( | ||
| None, | ||
| help="HuggingFace dense encoder model", | ||
| ), | ||
| sparse_model: Optional[str] = typer.Option( | ||
| None, | ||
| help="HuggingFace sparse encoder model", | ||
| ), | ||
| env_var: Optional[str] = typer.Option( | ||
| None, | ||
| help="Path to .env file, if not set, will not load any .env file", | ||
| ), | ||
| version: bool = typer.Option( | ||
| None, "--version", "-v", callback=version_callback, help="App version" | ||
| ), | ||
| ): | ||
| """Run embedding on PEPs""" | ||
| # Import here to avoid circular imports | ||
| from .pepembed import pepembed | ||
|
|
||
| if env_var: | ||
| load_dotenv(dotenv_path=env_var) | ||
|
|
||
| collection_name = qdrant_collection or os.environ.get( | ||
| "QDRANT_COLLECTION", QDRANT_DEFAULT_COLLECTION | ||
| ) | ||
| hf_model_dense = dense_model or os.environ.get( | ||
| "HF_MODEL_DENSE", DENSE_ENCODER_MODEL | ||
| ) | ||
| hf_model_sparse = sparse_model or os.environ.get( | ||
| "HF_MODEL_SPARSE", SPARSE_ENCODER_MODEL | ||
| ) | ||
|
|
||
| pepembed( | ||
| batch_size=batch_size, | ||
| recreate_collection=recreate_collection, | ||
| collection_name=collection_name, | ||
| hf_model_dense=hf_model_dense, | ||
| hf_model_sparse=hf_model_sparse, | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| app() | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.