Dataset should support binary Buffer payload #3211

monolithed · 2025-10-19T15:31:28Z

monolithed
Oct 19, 2025

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/core

Feature

It would be great to have a plugin or API that allows accessing the binary data directly, or at least the option to store it in a Dataset without serialization.

I see that the current interface allows restoring data via Buffer.from, but I’m not sure about the efficiency of this approach or the acceptable data size it can handle.

Motivation

I’d like to share my use case.

I have the Crawlee logic extracted into a separate package, and the result of this package’s work is a screenshot.
Right now, I have to upload it to S3 directly inside the requestHandler, which feels like a terrible anti-pattern.

Ideal solution or implementation, and any additional constraints

The most straightforward solution would be to provide an interface for working with S3.

Alternative solutions or implementations

No response

Other context

No response

barjin · 2025-10-23T12:09:56Z

barjin
Oct 23, 2025
Maintainer

Hello, and thank you for your interest in this project!

By definition, the Dataset is a storage for serializable "data objects", i.e. JSON objects with strings, numbers, arrays, other such objects etc.

Because of the dataset implementation, you cannot store binary data directly into it - the Dataset backend (both local and on Apify Platform) stores the individual Dataset items as serialized JSON objects.

If you want to store binary data from the Crawlee crawlers, you can use the Key-Value Store class (docs). This is a Crawlee-native S3-like storage that allows you to store arbitrary (binary) data under string keys.

Alternatively, if you have to store the data in the dataset, you could use base64 or a similar encoding to convert the bytes into a string. Keep in mind that base64 increases the size of the data by roughly 30%.

I'll close this discussion as solved, but feel free to ask additional questions if you have any. Cheers!

0 replies

monolithed · 2025-10-23T14:34:49Z

monolithed
Oct 23, 2025
Author

@barjin, thanks for the reply!
I ran some tests, and for small binary data (including screenshots up to 1–2 MB) I can confirm that the Dataset handles it well.
I’m sure the KeyValueStore would work fine too. However, I have some doubts when it comes to handling larger data volumes and synchronization, and here’s why:

Disk operations in distributed systems are usually more expensive and slower than network operations.
Right now, getting a file handle is synchronous, which means IO will constantly be blocked and the worker can’t parallelize tasks effectively (I’ve already found the issue where this is planned to be fixed).
Maintaining a machine with a KeyValueStore would likely require a dedicated engineer to monitor disk usage and load.
If a task crashes, the worker won’t be able to recover, resume, or hand it off, since it might run on a different machine. The advantage of using something like S3 is that I can start a task on one worker and finish it on another - for example, to move it to a less loaded server. In the current setup, this becomes a bottleneck.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset should support binary Buffer payload #3211

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Dataset should support binary Buffer payload #3211

Uh oh!

Uh oh!

monolithed Oct 19, 2025

Which package is the feature request for? If unsure which one to select, leave blank

Feature

Motivation

Ideal solution or implementation, and any additional constraints

Alternative solutions or implementations

Other context

Replies: 2 comments

Uh oh!

barjin Oct 23, 2025 Maintainer

Uh oh!

monolithed Oct 23, 2025 Author

monolithed
Oct 19, 2025

barjin
Oct 23, 2025
Maintainer

monolithed
Oct 23, 2025
Author