Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,4 @@ dist/*
# Documentation ancillary files
docs/*json
docs/*HDF5
docs/*tif
31 changes: 28 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ JSON files that contain SHA 256 hash values for all variables and groups in
a netCDF4 or HDF-5 file can be generated using either the `create_h5_hash_file`
or `create_nc4_hash_file`.

```
```python
from earthdata_hashdiff import create_nc4_hash_file


Expand All @@ -33,14 +33,27 @@ The functions to create the hash files have two additional optional arguments:
The default value for this kwarg is to turn off all `xarray` decoding for
CF Conventions, coordinates, times and time deltas.

A similar JSON file can be created for a GeoTIFF file:

```python
from earthdata_hashdiff import create_geotiff_hash_file

create_geotiff_hash_file('path/to/geotiff/file.tif', 'path/to/output/hash.json')
```

This function has one additional optional argument:

* `skipped_metadata_tags` - this is a set of strings. When specified, the
hashing functionality will not include GeoTIFF metadata tags with that name.

### Comparisons against reference files

When a JSON file exists with hashed values, it can be used for comparisons. The
public API provides `h5_matches_reference_hash_file` and
`nc4_matches_reference_hash_file`, although these both are aliases for the same
underlying functionality using `xarray`:

```
```python
from earthdata_hashdiff import nc4_matches_reference_hash_file


Expand Down Expand Up @@ -68,6 +81,18 @@ The comparison functions have three optional arguments:
The default value for this kwarg is to turn off all `xarray` decoding for
CF Conventions, coordinates, times and time deltas.

The same operation can also be performed for a GeoTIFF file in comparison to an
appropriate JSON reference file:

```python
from earthdata_hashdiff import geotiff_matches_reference_hash_file

assert geotiff_matches_reference_hash_file(
'path/to/geotiff/file.tif',
'path/to/json/with/hash.json',
)
```

## Installing

### Using pip
Expand Down Expand Up @@ -102,7 +127,7 @@ also contains an update to the `earthdata_hashdiff.__about__.py` file.

Prerequisites:

- Python 3.10+, ideally installed in a virtual environment, such as `pyenv`
- Python 3.11+, ideally installed in a virtual environment, such as `pyenv`
or `conda`.
- A local copy of this repository.

Expand Down
74 changes: 71 additions & 3 deletions docs/Using_earthdata-hashdiff.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"\n",
"## What is earthdata-hashdiff?\n",
"\n",
"`earthdata-hashdiff` is a Python package that parses Earth science data file formats (HDF-5 and netCDF4) and hashes the contents of those files. These hashes are stored in a JSON object, which can be saved to disk. This enables the easy storage of a smaller artefact for tasks such as regression testing, while omitting metadata and data attributes that may change between test executions (such as timestamps in history attributes). The package also allows for comparison between a binary file (HDF-5 or netCDF4) and a JSON file containing previously calculated hashes.\n",
"`earthdata-hashdiff` is a Python package that parses Earth science data file formats (HDF-5, netCDF4 and GeoTIFF) and hashes the contents of those files. These hashes are stored in a JSON object, which can be saved to disk. This enables the easy storage of a smaller artefact for tasks such as regression testing, while omitting metadata and data attributes that may change between test executions (such as timestamps in history attributes). The package also allows for comparison between a binary file (HDF-5, netCDF4 or GeoTIFF) and a JSON file containing previously calculated hashes.\n",
"\n",
"## earthdata-hashdiff installation:\n",
"\n",
Expand Down Expand Up @@ -41,7 +41,11 @@
"* [3B-HHR.MS.MRG.3IMERG.20250331-S220000-E222959.1320.V07B.HDF5](https://data.gesdisc.earthdata.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/2025/090/3B-HHR.MS.MRG.3IMERG.20250331-S220000-E222959.1320.V07B.HDF5)\n",
"* [3B-HHR.MS.MRG.3IMERG.20250331-S223000-E225959.1350.V07B.HDF5](https://data.gesdisc.earthdata.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/2025/090/3B-HHR.MS.MRG.3IMERG.20250331-S223000-E225959.1350.V07B.HDF5)\n",
"\n",
"The notebook will assume that these two files are present in the `docs` directory:"
"Additionally, for GeoTIFF examples, this notebook uses sample data from the ECOsystem Spaceborne Thermal Radiometer Experiment on Space Station (ECOSTRESS) mission. To run examples with GeoTIFFs, please also download the following sample land surface temperature file:\n",
"\n",
"* [ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01_LST.tif](https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01/ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01_LST.tif)\n",
"* \n",
"The notebook will assume that these files are present in the `docs` directory:"
]
},
{
Expand All @@ -56,7 +60,9 @@
")\n",
"gpm_3imerghh_granule_two = (\n",
" '3B-HHR.MS.MRG.3IMERG.20250331-S223000-E225959.1350.V07B.HDF5'\n",
")"
")\n",
"\n",
"ecostress_granule = 'ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01_LST.tif'"
]
},
{
Expand Down Expand Up @@ -240,6 +246,43 @@
"print(json.dumps(gpm_3imerghh_granule_one_decode_hashes, indent=2))"
]
},
{
"cell_type": "markdown",
"id": "a9d1fc61-f2c4-4f8b-b784-e456753d51d8",
"metadata": {},
"source": [
"## Hashing GeoTIFFs:\n",
"\n",
"From version 1.1.0 onwards, `earthdata-hashdiff` can also calculate a hash for a GeoTIFF input. A single hash is generated for the full file, which accounts for:\n",
"\n",
"* The data array shape and elements.\n",
"* GeoTIFF-specific metadata tags.\n",
"\n",
"To remain lightweight, `earthdata-hashdiff` uses the [tifffile package]() to parse GeoTIFF files, rather than requiring GDAL to be installed in the local environment.\n",
"\n",
"The cell below shows the usage of hashing functionality for a GeoTIFF. Note that this function also has the optional `skipped_metadata_tags` argument, which is analogous to the `skipped_metadata_attributes` for netCDF4 and HDF-5 files."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "655e3e8e-622f-4923-b55e-7f1237382b03",
"metadata": {},
"outputs": [],
"source": [
"from earthdata_hashdiff import create_geotiff_hash_file, get_hash_from_geotiff_file\n",
"\n",
"# Create an in-memory dictionary for the GeoTIFF hash value:\n",
"geotiff_hash_dictionary = get_hash_from_geotiff_file(ecostress_granule, set())\n",
"print(json.dumps(geotiff_hash_dictionary, indent=2))\n",
"\n",
"# Generate the same hash dictionary and write out to a JSON file:\n",
"create_geotiff_hash_file(\n",
" ecostress_granule,\n",
" f'{ecostress_granule}.json',\n",
")"
]
},
{
"cell_type": "markdown",
"id": "170873bf-39f2-4907-a9c9-78fe49dee330",
Expand Down Expand Up @@ -405,6 +448,31 @@
"), 'Binary file did not match previously generated hashes.'"
]
},
{
"cell_type": "markdown",
"id": "e882be95-bc98-4aeb-82cf-68563e949973",
"metadata": {},
"source": [
"## Comparisons with GeoTIFFs\n",
"\n",
"These work in the same way as the comparisons for netCDF4 and HDF-5 files. The cell below will use the previously generated JSON reference file for the ECOSTRESS granule:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "48971368-e17a-4267-be2d-f8a2f7680a83",
"metadata": {},
"outputs": [],
"source": [
"from earthdata_hashdiff import geotiff_matches_reference_hash_file\n",
"\n",
"assert geotiff_matches_reference_hash_file(\n",
" ecostress_granule,\n",
" f'{ecostress_granule}.json',\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c61a5f43-2bf2-42f6-8c39-abeef381816f",
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# These packages are required to run the documentation Jupyter notebook.
earthdata-hashdiff ~= 1.0.1
earthdata-hashdiff ~= 1.1.0
notebook ~= 7.4.5
requests ~= 2.32.4