Skip to content

Improve CSV chunk ingestion reliability, add HMSDMS RA/Dec support, and include Smith Southern Sky compilation tutorial#16

Merged
Yong2Sheng merged 2 commits intodevelopfrom
feature/Smith_Southern_Standard_Stars
Mar 3, 2026
Merged

Improve CSV chunk ingestion reliability, add HMSDMS RA/Dec support, and include Smith Southern Sky compilation tutorial#16
Yong2Sheng merged 2 commits intodevelopfrom
feature/Smith_Southern_Standard_Stars

Conversation

@Yong2Sheng
Copy link
Copy Markdown
Owner

Summary

This PR contains three practical improvements for my standard star catalog ingestion pipeline:

  1. I centralized the logic that estimates tqdm(total=...) for pd.read_csv(..., chunksize=...) so the progress bar total stays consistent with the actual reader behavior when header, names, and skiprows are configured in different combinations.
  2. I added an option to support RA/Dec inputs provided in hmsdms format when writing the HDF5 standard star table. When enabled, the function converts coordinates to degrees before computing HEALPix ipix and the coarse bucket id.
  3. I added a tutorial notebook that demonstrates how I compile the Smith Southern Sky standard star catalog end to end.

Motivation

  • I previously computed the expected number of chunks using a fast physical line counter and a manually tracked skiprows. This worked for older catalogs where I used header=None and skiprows=1, but it became fragile when switching to catalogs with a proper header (header=0) or when combining header=0 with names=... to override column names.
  • The mismatch between my estimated totals and the actual pd.read_csv iterator caused tqdm to end early relative to total, showing a red progress bar even though no exception was raised.
  • Some catalogs provide RA/Dec as hmsdms, so I needed a first class way to ingest those without adding ad hoc conversions outside the writer.
  • I also wanted a reproducible, documented example for compiling Smith Southern Sky so that I (and future users) can rerun the workflow with minimal guesswork.

Changes

1) Robust progress bar total estimation for chunked CSV reads

  • I added a helper that infers the number of non data leading lines based on read_csv_kwargs (for example, accounting for the header line when header=0).
  • The helper then estimates total_chunks = ceil(n_data_lines / chunksize) and returns None when it cannot safely infer a total (for example, when skiprows is list like or callable). In that case I pass total=None to tqdm to avoid incorrect totals.

Example usage:

total_chunks = estimate_total_chunks(
    file=file,
    chunksize=chunksize,
    read_csv_kwargs=read_csv_kwargs,
    count_lines_fn=count_lines_fast,
)

reader = pd.read_csv(file, chunksize=chunksize, **read_csv_kwargs)

pbar_chunks = tqdm(
    reader,
    desc="Chunks",
    total=(total_chunks if total_chunks not in (None, 0) else None),
    position=1,
    leave=False,
    dynamic_ncols=True,
)

2) Add ra_dec_hmsdms support in write_std_h5

  • I added a boolean argument ra_dec_hmsdms.
  • When True, I convert the RA/Dec columns from (hourangle, deg) to degrees using astropy.coordinates.SkyCoord, then proceed with HEALPix indexing as usual.

Key logic:

if ra_dec_hmsdms:
    coord = SkyCoord(
        ra=chunk["ra"],
        dec=chunk["dec"],
        unit=(u.hourangle, u.deg),
    )
    chunk["ra"] = coord.ra.deg
    chunk["dec"] = coord.dec.deg

ra = chunk["ra"].to_numpy(dtype=float) * u.deg
dec = chunk["dec"].to_numpy(dtype=float) * u.deg
ipix = hp.lonlat_to_healpix(ra, dec).astype(np.int32)
bucket = (ipix // int(bucket_size)).astype(np.int32)

3) Add tutorial notebook for Smith Southern Sky compilation

  • I added a tutorial notebook that walks through the complete workflow to compile the Smith Southern Sky standard star catalog.

Testing

  • I verified that chunk iteration completes with a consistent progress bar total for multiple CSV layouts:
    • catalogs with a real header (header=0)
    • catalogs where I override column names (header=0, names=colnames)
    • legacy catalogs where I skip a nonstandard header line (header=None, names=colnames, skiprows=1)
  • I tested write_std_h5 on a catalog providing RA/Dec as hmsdms and confirmed:
    • coordinate conversion succeeds
    • HEALPix ipix and bucket columns are computed as expected
    • HDF5 table append and index creation completes
  • I ran the new Smith Southern Sky tutorial notebook and confirmed it produces the expected HDF5 output.

Notes

  • I need to update previous notebook. But I am toooo tired. Need rest now. I will update with another bugfix branch

…s; fix bug when counting number of lines and chunks for tqdm progress bar; make the code comptiable with RA Dec in hmddms format.
@Yong2Sheng Yong2Sheng merged commit 7d35480 into develop Mar 3, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant