async-provider-etl is a Python package designed to perform ETL operations on hospital CMS data using asynchronous
and parallel processing. The package downloads datasets concurrently, processes the resultant dataframes in parallel, and stores the hospital data &
associated metadata in SQLite databases.
- Asynchronous Data Extraction: Utilizes
aiohttpfor efficient, non-blocking HTTP requests to download datasets. - Parallel Data Processing: Leverages
asyncio,ProcessPoolExecutorandThreadPoolExecutorfor concurrent and parallel processing. - SQLite Integration: Stores metadata and processed data in SQLite databases using
aiosqlite, ensuring efficient, non-blocking queries to the embedded database. - Command-Line Interface: Configurable via CLI arguments for verbose logging.
If
pipxis not already installed, you will need to install it .
To install the .whl file using pipx, you can use the following command:
$ pipx install "dist/async_provider_etl-0.1.0-py3-none-any.whl"Then to trigger the ETL job, simply call the installed package:
$ async-provider-etl