-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Bitcoin graph dataset #2912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Bitcoin graph dataset #2912
Conversation
|
Hi @VJalili , fantastic start on your 101 tutorial noteboook! Our assumption is that, for the release version, you'll repoint the examples to work from the full corpus you're making available on AWS. Other than that, looks great. |
|
Hi @pschmied, thanks for the review! Your assumption is correct. To reiterate, the Bitcoin graph we're making available on AWS is a large-scale single graph (>2.4B nodes and ~40B edges). A common practice for training ML models on such a large graph is to train on sampled communities. The 101 tutorial is focused on using pre-sampled communities; these pre-sampled communities enable the ML community to quickly explore the dataset and "smoke test" its compatibility with various graph neural network architectures. The pre-sampled communities will be hosted on AWS, and for the release, we will update the links on the notebook to point to buckets on AWS. We'll also update the notebook to guide users toward using batches of the data (i.e., independent sub-graphs in TSV files). Moreover, the dataset is also prepared for usage in graph databases (e.g., Neo4j or Amazon Neptune). We recommend the community load the dataset into a graph database, as it provides them with the option of sampling application-specific communities for their ML pipeline (we provide both methods and tutorials on this page). Since this use-case involves using specialized graph databases, runs on ~1TB of data, and takes days to run, we provide dedicated documentation and guidelines, and these resources will also point to the dataset on AWS (e.g., on the data release page). |
|
@pschmied I prepared a more comprehensive notebook that covers all the data hosted on this dataset's AWS bucket. Here is the link to the notebook: https://github.com/B1AAB/GraphStudio/blob/main/g101/g101.ipynb If you find this more comprehensive and focused than the other, I can update the link in the yaml file to refer to g101. |
|
@VJalili I love it—perhaps combine the content? We really do want to make sure the community challenge question / problem remains. In general, the more data providers can demonstrate opinionated usage of a given dataset, the more help it is to would-be data users. Really appreciate your efforts here! |
|
@pschmied Glad you liked it!
I like that, we can merge.
Are you referring to the
Sure! We can keep Kaggle as the alternative option. |
|
I was thinking more of the last question:
You are doing a great job of illustrating things you have done / can do with the data. And no, we're not wed to the literal template format. We generally want a basic intro notebook to have those elements, but we intentionally left room for improvement / expansion :-) |
|
I like that, it will be very helpful, thanks @pschmied Please take a look at the updated notebook in the following PR; warmly appreciate all feedback! |
Description of changes:
This PR adds a YAML file that describes the Bitcoin Graph dataset resource.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.