Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
a2c5895
Install postgres dependencies for tests
JeremyMcCormick Mar 19, 2026
0e697d4
Rename test modules using snakecase
JeremyMcCormick Feb 17, 2026
074b4b7
Write update records to JSON file when storing replica chunks
JeremyMcCormick Feb 10, 2026
1cb3495
Separate APDB test functionality into a mixin class
JeremyMcCormick Feb 10, 2026
a4d2217
Use dax_ppdbx_gcp ticket for development
JeremyMcCormick Feb 11, 2026
eb04367
WIP: Add test of GCS upload
JeremyMcCormick Feb 11, 2026
ebfad29
Upload the JSON file with APDB record updates when present
JeremyMcCormick Feb 11, 2026
6d7dca2
WIP: Add support for expanding update records (modules need to be ren…
JeremyMcCormick Feb 13, 2026
dcbf992
WIP: Updates to ppdb_bigquery and test
JeremyMcCormick Feb 13, 2026
2312aee
Create new package for handling APDB updates
JeremyMcCormick Feb 17, 2026
e307d47
Add expanded_update_record module
JeremyMcCormick Feb 17, 2026
021af3d
Rename update_handler module to update_record_expander
JeremyMcCormick Feb 17, 2026
465cf32
Model record_id as a list of integers
JeremyMcCormick Feb 17, 2026
4a9df5d
Add insertion of update records into BigQuery
JeremyMcCormick Feb 17, 2026
db06460
Add preliminary implementation of update records table dedup in BQ
JeremyMcCormick Feb 18, 2026
f3a1924
Use a hashed value of record ID for deduplication
JeremyMcCormick Feb 18, 2026
be9175f
WIP on update merge implementation
JeremyMcCormick Feb 18, 2026
20e3b2b
Add google-cloud-bigquery requirement
JeremyMcCormick Feb 19, 2026
ecca19b
Rearrange tests to guard against missing google deps
JeremyMcCormick Feb 20, 2026
74c4f74
Add update merging support for DiaSource and DiaForcedSource tables
JeremyMcCormick Feb 20, 2026
300a228
Remove requirements that we don't want installed by default in testing
JeremyMcCormick Feb 21, 2026
183bae5
ruff
JeremyMcCormick Feb 21, 2026
61821dc
Add build tools to Dockerfile
JeremyMcCormick Feb 23, 2026
5d1a3af
Move engine creation out of `make_database` method
JeremyMcCormick Feb 24, 2026
8d48f47
Move building of connect args into separate method
JeremyMcCormick Feb 24, 2026
cb9931b
Move listener config to separate method
JeremyMcCormick Feb 24, 2026
41442f8
Add support for getting db password from Google Secrets Manager
JeremyMcCormick Feb 24, 2026
dfaa656
Rearrange SQL init code
JeremyMcCormick Feb 25, 2026
bc9aeeb
Rename the `config` module to `ppdb_config`
JeremyMcCormick Feb 25, 2026
331cf4a
Move the method for getting promotable chunks to PpdbBigQuery
JeremyMcCormick Feb 25, 2026
3df612e
Move SQL files into `config/sql` dir
JeremyMcCormick Feb 25, 2026
295e1dc
Add `sql_resource` moduel for accessing SQL files as resources
JeremyMcCormick Feb 25, 2026
c9d375d
Add GCS URI to PpdbReplicaChunkExtended model and database
JeremyMcCormick Feb 26, 2026
da7d823
Add `UpdatesManager` for applying updates from JSON files in GCS
JeremyMcCormick Feb 26, 2026
3264f36
Add .scratch to .gitignore
JeremyMcCormick Feb 27, 2026
73869d5
Cleanup some test classes (WIP) and other minor changes
JeremyMcCormick Feb 27, 2026
b34a138
Add BigQuery classes from dax_ppdbx_gcp
JeremyMcCormick Feb 27, 2026
380a0a6
WIP: Integrate application of updates into promotion process
JeremyMcCormick Mar 18, 2026
c9f6510
Add check in tests to skip if there are no valid Google credentials
JeremyMcCormick Mar 19, 2026
4072a8d
FIXUP
JeremyMcCormick Mar 19, 2026
7484bca
Fix type alias issue reported by ruff
JeremyMcCormick Mar 19, 2026
d52dfb4
Add missing docstring
JeremyMcCormick Mar 19, 2026
f7cd791
Add `mark_chunks_promoted` method
JeremyMcCormick Mar 19, 2026
0519130
Fix mypy errors
JeremyMcCormick Mar 19, 2026
bb01e3a
WIP: Introduce class for handling SQL passwords
JeremyMcCormick Mar 20, 2026
855c826
Fix circular reference in imports
JeremyMcCormick Mar 20, 2026
cdfa69f
Remove unnecessary property functions
JeremyMcCormick Mar 20, 2026
32f0fea
Remove no longer necessary check for test execution
JeremyMcCormick Mar 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/workflows/build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,17 @@ jobs:
mamba install -y -q pip wheel
pip install uv

- name: Install Postgres for testing
shell: bash -l {0}
run: |
mamba install -y -q postgresql

- name: Install dependencies
shell: bash -l {0}
run: |
uv pip install -r requirements.txt
uv pip install testing.postgresql

# We have two cores so we can speed up the testing with xdist
- name: Install pytest packages
shell: bash -l {0}
run: |
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,6 @@ pytest_session.txt

# VS Code
.vscode

# Scratch directory
.scratch
13 changes: 8 additions & 5 deletions docker/Dockerfile.replication
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,14 @@ FROM python:3.12-slim-bookworm
ENV DEBIAN_FRONTEND=noninteractive

# Update and install OS dependencies
RUN apt-get -y update && \
apt-get -y upgrade && \
apt-get -y install --no-install-recommends git && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
build-essential \
python3-dev \
pkg-config \
git \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# Install required python build dependencies
RUN pip install --upgrade --no-cache-dir pip setuptools wheel uv
Expand Down
7 changes: 3 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,12 @@ classifiers = [
keywords = ["lsst"]
dependencies = [
"astropy",
"google-cloud-bigquery",
"pyarrow",
"pydantic >=2,<3",
"pyyaml >= 5.1",
"sqlalchemy",
"lsst-dax-ppdbx-gcp",
"lsst-felis",
"lsst-sdm-schemas",
"lsst-utils",
Expand All @@ -43,9 +45,6 @@ test = [
"pytest >= 3.2",
"pytest-openfiles >= 0.5.0"
]
gcp = [
"lsst-dax-ppdbx-gcp"
]

[tool.setuptools.packages.find]
where = ["python"]
Expand All @@ -54,7 +53,7 @@ where = ["python"]
zip-safe = true

[tool.setuptools.package-data]
"lsst.dax.ppdb" = ["py.typed"]
"lsst.dax.ppdb" = ["py.typed", "config/schemas/*.yaml", "config/sql/*.sql"]

[tool.setuptools.dynamic]
version = { attr = "lsst_versions.get_lsst_version" }
Expand Down
2 changes: 1 addition & 1 deletion python/lsst/dax/ppdb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.

from .config import *
from .ppdb_config import *
from .ppdb import *
from .replicator import *
from .version import * # Generated by sconsUtils
2 changes: 1 addition & 1 deletion python/lsst/dax/ppdb/_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from .config import PpdbConfig
from .ppdb import Ppdb
from .ppdb_config import PpdbConfig


def config_type_for_name(type_name: str) -> type[PpdbConfig]:
Expand Down
1 change: 1 addition & 0 deletions python/lsst/dax/ppdb/bigquery/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,6 @@
# along with this program. If not, see <https://www.gnu.org/licenses/>.

from .manifest import Manifest
from .chunk_uploader import ChunkUploader
from .ppdb_bigquery import PpdbBigQuery, PpdbBigQueryConfig
from .ppdb_replica_chunk_extended import ChunkStatus, PpdbReplicaChunkExtended
60 changes: 38 additions & 22 deletions python/lsst/dax/ppdb/bigquery/chunk_uploader.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
) from e


from ..config import PpdbConfig
from ..ppdb_config import PpdbConfig
from .manifest import Manifest
from .ppdb_bigquery import PpdbBigQuery, PpdbBigQueryConfig
from .ppdb_replica_chunk_extended import ChunkStatus, PpdbReplicaChunkExtended
Expand Down Expand Up @@ -237,15 +237,27 @@ def _process_chunk(self, replica_chunk: PpdbReplicaChunkExtended) -> None:
)

# Make a list of local parquet files to upload.
parquet_files = list(chunk_dir.glob("*.parquet"))
upload_file_list = list(chunk_dir.glob("*.parquet"))

# Include the update records file if the manifest indicates it should
# exist
if manifest.includes_update_records:
update_records_file = chunk_dir / "update_records.json"
if not update_records_file.exists():
raise ChunkUploadError(
chunk_id,
f"Manifest indicates update records are included but file does not exist: "
f"{update_records_file}",
)
upload_file_list.append(update_records_file)

# Check if the chunk is expected to be empty.
is_empty = manifest.is_empty_chunk()

if not parquet_files and not is_empty:
if not upload_file_list and not is_empty:
# There is a mismatch between the manifest and the actual files.
# Some processing error may have occurred when exporting.
raise ChunkUploadError(chunk_id, f"No parquet files found in {chunk_dir} for non-empty chunk")
raise ChunkUploadError(chunk_id, f"No files found to upload in {chunk_dir} for non-empty chunk")

# Check that all expected parquet files from the manifest are present.
for table_name, table_stats in manifest.table_data.items():
Expand All @@ -258,24 +270,21 @@ def _process_chunk(self, replica_chunk: PpdbReplicaChunkExtended) -> None:
)

try:
# 1) Upload parquet files, which will happen only for non-empty
# chunks.
if parquet_files:
gcs_names = {path: posixpath.join(gcs_prefix, path.name) for path in parquet_files}
# 1) Upload the files to GCS for non-empty chunks
if upload_file_list:
gcs_names = {path: posixpath.join(gcs_prefix, path.name) for path in upload_file_list}
try:
_LOG.info(
"Uploading %d parquet files to GCS under prefix: %s", len(gcs_names), gcs_prefix
)
_LOG.info("Uploading %d files to GCS under prefix: %s", len(gcs_names), gcs_prefix)
with Timer(
"upload_files_time", _MON, tags={"prefix": str(gcs_prefix), "chunk_id": str(chunk_id)}
) as timer:
self.storage.upload_files(gcs_names)
total_bytes = sum(p.stat().st_size for p in parquet_files)
total_bytes = sum(p.stat().st_size for p in upload_file_list)
timer.add_values(file_count=len(gcs_names), total_bytes=total_bytes)
except* UploadError as eg:
raise ChunkUploadError(chunk_id, f"{len(eg.exceptions)} upload(s) failed") from eg

# 2) Upload manifest, even for empty chunks.
# 2) Upload manifest, even for empty chunks
try:
self.storage.upload_from_string(
posixpath.join(gcs_prefix, replica_chunk.manifest_name),
Expand All @@ -284,22 +293,29 @@ def _process_chunk(self, replica_chunk: PpdbReplicaChunkExtended) -> None:
except UploadError as e:
raise ChunkUploadError(chunk_id, "Manifest upload failed") from e

# 3) Update DB status, but not for empty chunks.
# Next two steps are inapplicable to empty chunks.
if not is_empty:
# 3) Update status and GCS URI in the database
gcs_prefix = posixpath.join(self.bucket_name, gcs_prefix)
updated_replica_chunk = replica_chunk.with_new_status(ChunkStatus.UPLOADED).with_new_gcs_uri(
f"gs://{gcs_prefix}"
)
try:
self._bq.store_chunk(replica_chunk.with_new_status(ChunkStatus.UPLOADED), True)
self._bq.store_chunk(updated_replica_chunk, True)
_LOG.info(
"Updated replica chunk %d in database with status 'uploaded' and GCS URI: %s",
chunk_id,
gcs_prefix,
)
except Exception as e:
raise ChunkUploadError(
chunk_id, "failed to update replica chunk status in database"
) from e
raise ChunkUploadError(chunk_id, "Failed to update replica chunk in database") from e

# 4) Publish Pub/Sub staging message to trigger BigQuery load, but
# not for empty chunks. (Empty chunks cannot be staged.)
if not is_empty:
# 4) Publish Pub/Sub event to trigger staging of the chunk in
# BigQuery
try:
self._post_to_stage_chunk_topic(self.bucket_name, gcs_prefix, chunk_id)
except Exception as e:
Comment on lines +296 to 317
raise ChunkUploadError(chunk_id, "failed to publish staging message") from e
raise ChunkUploadError(chunk_id, "Failed to publish staging message") from e

except ChunkUploadError as err:
try:
Expand Down
15 changes: 11 additions & 4 deletions python/lsst/dax/ppdb/bigquery/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,10 @@ class Manifest(BaseModel):
"""Name of the compression format used for artifacts (e.g., "gzip",
"zstd", "snappy", etc.)."""

includes_update_records: bool = False
"""Whether the exported data includes update records (e.g., in a separate
file) or not (`bool`)."""

@property
def filename(self) -> str:
"""Generate the filename for this manifest based on the replica chunk
Expand Down Expand Up @@ -118,12 +122,15 @@ def from_json_file(cls, file_path: Path) -> Manifest:

def is_empty_chunk(self) -> bool:
"""Check if the manifest represents an empty replica chunk in which
all tables have zero rows.
all tables have zero rows and no update records are included.

Returns
-------
bool
`True` if all tables have zero rows, indicating an empty chunk,
`False` otherwise.
`True` if all tables have zero rows and no update records are
included, indicating an empty chunk, `False` otherwise.
"""
return all(table.row_count == 0 for table in self.table_data.values())
return (
all(table.row_count == 0 for table in self.table_data.values())
and not self.includes_update_records
)
Loading
Loading