Skip to content

Conversation

@VJalili
Copy link

@VJalili VJalili commented Oct 19, 2025

Description of changes:

This PR adds a YAML file that describes the Bitcoin Graph dataset resource.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@VJalili VJalili marked this pull request as ready for review October 28, 2025 22:58
@pschmied
Copy link
Contributor

pschmied commented Nov 5, 2025

Hi @VJalili , fantastic start on your 101 tutorial noteboook! Our assumption is that, for the release version, you'll repoint the examples to work from the full corpus you're making available on AWS. Other than that, looks great.

@VJalili
Copy link
Author

VJalili commented Nov 6, 2025

Hi @pschmied, thanks for the review!

Your assumption is correct. To reiterate, the Bitcoin graph we're making available on AWS is a large-scale single graph (>2.4B nodes and ~40B edges). A common practice for training ML models on such a large graph is to train on sampled communities. The 101 tutorial is focused on using pre-sampled communities; these pre-sampled communities enable the ML community to quickly explore the dataset and "smoke test" its compatibility with various graph neural network architectures. The pre-sampled communities will be hosted on AWS, and for the release, we will update the links on the notebook to point to buckets on AWS. We'll also update the notebook to guide users toward using batches of the data (i.e., independent sub-graphs in TSV files).

Moreover, the dataset is also prepared for usage in graph databases (e.g., Neo4j or Amazon Neptune). We recommend the community load the dataset into a graph database, as it provides them with the option of sampling application-specific communities for their ML pipeline (we provide both methods and tutorials on this page). Since this use-case involves using specialized graph databases, runs on ~1TB of data, and takes days to run, we provide dedicated documentation and guidelines, and these resources will also point to the dataset on AWS (e.g., on the data release page).

@VJalili
Copy link
Author

VJalili commented Nov 11, 2025

@pschmied I prepared a more comprehensive notebook that covers all the data hosted on this dataset's AWS bucket. Here is the link to the notebook: https://github.com/B1AAB/GraphStudio/blob/main/g101/g101.ipynb

If you find this more comprehensive and focused than the other, I can update the link in the yaml file to refer to g101.

@pschmied
Copy link
Contributor

@VJalili I love it—perhaps combine the content? We really do want to make sure the community challenge question / problem remains. In general, the more data providers can demonstrate opinionated usage of a given dataset, the more help it is to would-be data users. Really appreciate your efforts here!

@VJalili
Copy link
Author

VJalili commented Nov 11, 2025

@pschmied Glad you liked it!

perhaps combine the content?

I like that, we can merge.

We really do want to make sure the community challenge question / problem remains.

Are you referring to the Q: What is one question that you have answered using these data? Can you show us how you came to that answer? question? Also, does it need to be in the same words, or can we rephrase it to better match the dataset?

In general, the more data providers can demonstrate opinionated usage of a given dataset, the more help it is to would-be data users.

Sure! We can keep Kaggle as the alternative option.

@pschmied
Copy link
Contributor

I was thinking more of the last question:

Q: What is one unanswered question that you think could be answered using these data? Do you have any recommendations or advice for someone wanting to answer this question?

You are doing a great job of illustrating things you have done / can do with the data.

And no, we're not wed to the literal template format. We generally want a basic intro notebook to have those elements, but we intentionally left room for improvement / expansion :-)

@VJalili
Copy link
Author

VJalili commented Nov 12, 2025

I like that, it will be very helpful, thanks @pschmied

Please take a look at the updated notebook in the following PR; warmly appreciate all feedback!

B1AAB/GraphStudio#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants