Draft
Conversation
Split Avro-format downloads (simple Avro) into chunks as they are uploaded. Upload contents of Zip files (very rough WIP)
| LOG.debug("Copying Avro data to new file {}", output.getAbsolutePath()); | ||
|
|
||
| dfw = new RawDataFileWriter<>(rdw); | ||
| dfw.setCodec(CodecFactory.deflateCodec(8)); // TODO: Configure compression? |
Member
There was a problem hiding this comment.
I can't imagine we need to.
Deflate is the only one that is required in the Avro spec so everything will support it.
Member
Author
There was a problem hiding this comment.
I just mean that 8, I don't know where the sweet spot is between CPU time, network bandwidth and storage cost. Possibly it should just be the maximum, to reduce the storage cost.
Member
There was a problem hiding this comment.
Makes sense. Maximum compression until we find a need not to
f12a160 to
b997556
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A task in the work programme is to provide GBIF-mediated data on open public and research cloud infrastructures, for easier use of very large datasets and improved data persistence.
We already have funding from Microsoft AI for Earth, so the first cloud infrastructure will be Azure.
We also manually upload a GBIF download for Map of Life to Google GCS every month (to Map of Life's bucket), and some GBIF users have used Google BigQuery, so automating upload to GCS is useful too.
Finally, uploading any GBIF download to a cloud system can be useful where it allows users to avoid using a slow internet connection.
Therefore we should:
b. provide a way for any GBIF user to upload any GBIF download to their own Azure cloud storage, given that they provide the necessary credentials.
c. provide information/metadata to allow users of these data uploads to cite the data appropriately, either as a whole or by creating a derived dataset citation
The initial aim is to support the SIMPLE_AVRO format (which is SIMPLE_CSV but Avro). On HDFS, this is stored as a single Avro file, which can be split into chunks as it is uploaded. (I would avoid making everything run in parallel and as fast as possible — we don't necessarily want to use 100% of our network bandwidth on this.) SIMPLE_AVRO_WITH_VERBATIM and MAP_OF_LIFE formats would work the same way.
I also tried uploading zipped-Avro format downloads, i.e. BIONOMIA, which is a zip of three Avro tables, each containing many chunks in the zip file -- the code uploads the contents of the zip file, rather than the zipfile itself.
This is currently rough code, meant for exploring how the process could work.