Skip to content

Commit 255bacf

Browse files
Merge pull request #31 from patrickfleith/26-text-dataset-generation-documentation
26 text dataset generation documentation
2 parents 86e6893 + 057806b commit 255bacf

4 files changed

Lines changed: 348 additions & 12 deletions

File tree

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
from datafast.datasets import TextDataset
2+
from datafast.schema.config import TextDatasetConfig, PromptExpansionConfig
3+
from datafast.llms import OpenAIProvider, AnthropicProvider, GoogleProvider
4+
5+
6+
def main():
7+
# 1. Configure the dataset generation
8+
config = TextDatasetConfig(
9+
document_types=["tech journalism blog", "personal blog", "MSc lecture notes"],
10+
topics=["artificial intelligence", "cybersecurity"],
11+
num_samples_per_prompt=3,
12+
output_file="tech_posts.jsonl",
13+
languages={"en": "English", "fr": "French"},
14+
prompts=[
15+
(
16+
"Generate {num_samples} {document_type} entries in {language_name} about {topic}. "
17+
"The emphasis should be a perspective from {{country}}"
18+
)
19+
],
20+
expansion=PromptExpansionConfig(
21+
placeholders={
22+
"country": ["United States", "Europe", "Japan", "India"]
23+
},
24+
combinatorial=True,
25+
)
26+
)
27+
28+
# 2. Create LLM providers with specific models
29+
providers = [
30+
OpenAIProvider(model_id="gpt-4o-mini"),
31+
AnthropicProvider(model_id="claude-3-5-haiku-latest"),
32+
]
33+
34+
# 3. Generate the dataset
35+
dataset = TextDataset(config)
36+
dataset.generate(providers)
37+
38+
# 4. Push to HF hub (optional)
39+
USERNAME = "your_huggingface_username"
40+
DATASET_NAME = "your_dataset_name"
41+
url = dataset.push_to_hub(
42+
repo_id=f"{USERNAME}/{DATASET_NAME}",
43+
train_size=0.7, # for a 80/20 train/test split, otherwise omit
44+
seed=20250304,
45+
shuffle=True,
46+
)
47+
print(f"\nDataset pushed to Hugging Face Hub: {url}")
48+
49+
50+
if __name__ == "__main__":
51+
from dotenv import load_dotenv
52+
53+
load_dotenv("secrets.env")
54+
main()
Lines changed: 275 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
# How to Create a Raw Text Dataset
2+
3+
We'll create a raw text dataset for usage as part of a pre-training with the following characteristics:
4+
5+
* Multi-document types: generate different types of documents (blogs, lecture notes, etc.)
6+
* Multi-topic: generate texts on various topics (AI, cloud computing, etc.)
7+
* Multi-lingual: generate texts in several languages
8+
* Multi-LLM: generate texts using multiple LLM providers to boost diversity
9+
* Push the dataset to your Hugging Face Hub (optional)
10+
11+
!!! note
12+
In this guide we are generating raw text without using personas or seed texts. We are only specifying document types, topics, and languages. Generating synthetic data using personas or seed texts is a common use case which is on our roadmap but not yet available.
13+
14+
## Step 1: Import Required Modules
15+
16+
Generating a dataset with `datafast` requires 3 types of imports:
17+
18+
* Dataset
19+
* Configs
20+
* LLM Providers
21+
22+
```python
23+
from datafast.datasets import TextDataset
24+
from datafast.schema.config import TextDatasetConfig, PromptExpansionConfig
25+
from datafast.llms import OpenAIProvider, AnthropicProvider, GoogleProvider
26+
```
27+
28+
In addition, we'll use `dotenv` to load environment variables containing API keys.
29+
```python
30+
from dotenv import load_dotenv
31+
32+
# Load environment variables containing API keys
33+
load_dotenv("secrets.env")
34+
```
35+
36+
Make sure you have created a secrets.env file with your API keys. HF token is needed if you want to push the dataset to your HF hub. Other keys depend on which LLM providers you use. In our example, we use Google, OpenAI, and Anthropic.
37+
38+
```
39+
GOOGLE_API_KEY=XXXX
40+
OPENAI_API_KEY=sk-XXXX
41+
ANTHROPIC_API_KEY=XXXXX
42+
HF_TOKEN=XXXXX
43+
```
44+
45+
## Step 2: Configure Your Dataset
46+
47+
The `TextDatasetConfig` class defines all parameters for your text generation dataset.
48+
49+
- **`document_types`**: List of document types to generate (e.g., "tech journalism blog", "personal blog", "MSc lecture notes").
50+
51+
- **`topics`**: List of topics to generate content about (e.g., "technology", "artificial intelligence", "cloud computing").
52+
53+
- **`num_samples_per_prompt`**: Number of examples to generate in a single LLM call.
54+
55+
!!! note
56+
My recommendation is to use a smaller number (like 5) for text generation as these outputs tend to be longer.
57+
Use an even smaller number (like 2-3) if you generate very long texts like 300+ words.
58+
59+
- **`output_file`**: Path where the generated dataset will be saved (JSONL format).
60+
61+
- **`languages`**: Dictionary mapping language codes to their names (e.g., `{"en": "English", "fr": "French"}`).
62+
- You can use any language code and name you want. However, make sure that the underlying LLM provider you'll be using supports the language you're requesting.
63+
64+
- **`prompts`**: (Optional) Custom prompt templates.
65+
- **Mandatory placeholders**: When providing a custom prompt for `TextDatasetConfig`, you must always include the following variable placeholders in your prompt, using **single curly braces**:
66+
- `{num_samples}`: it uses the `num_samples_per_prompt` parameter defined above
67+
- `{language_name}`: it uses the `languages` parameter defined above
68+
- `{document_type}`: it comes from the `document_types` parameter defined above
69+
- `{topic}`: it comes from the `topics` parameter defined above
70+
- **Optional placeholders**: These placeholders can be used to expand the diversity of your dataset. They are optional, but can help you create a more diverse dataset. They must be written **using double curly braces**.
71+
72+
Here's a basic configuration example:
73+
74+
```python
75+
config = TextDatasetConfig(
76+
# Types of documents to generate
77+
document_types=["tech journalism blog", "personal blog", "MSc lecture notes"],
78+
79+
# Topics to generate content about
80+
topics=["technology", "artificial intelligence", "cloud computing"],
81+
82+
# Number of examples to generate per prompt
83+
num_samples_per_prompt=3,
84+
85+
# Output file path
86+
output_file="tech_posts.jsonl",
87+
88+
# Languages to generate data for
89+
languages={"en": "English", "fr": "French"},
90+
91+
# Custom prompts (optional - otherwise defaults will be used)
92+
# ...
93+
)
94+
```
95+
96+
## Step 3: Prompt Expansion for Diverse Examples (Optional)
97+
98+
Prompt expansion is a key concept in the `datafast` library. It helps generate multiple variations of a base prompt to increase the diversity of the generated data.
99+
100+
For example, we added one optional placeholder using double curly braces:
101+
102+
* `{{country}}`: To generate texts that elaborate on cloud computing or AI from different perspectives (e.g., "United States", "Canada", "Europe")
103+
104+
You can configure prompt expansion like this:
105+
106+
```python
107+
config = TextDatasetConfig(
108+
# Basic configuration as above
109+
# ...
110+
111+
# Custom prompt with placeholders (this will overwrite the default one). Watch out, you have to include the mandatory placeholders defined above.
112+
prompts=[
113+
(
114+
"Generate {num_samples} {document_type} entries in {language_name} " "about {topic}. "
115+
"The emphasis should be a perspective from {{country}}."
116+
)
117+
],
118+
119+
# Add prompt expansion configuration
120+
expansion=PromptExpansionConfig(
121+
placeholders={
122+
"country": ["United States", "Europe", "Japan", "India", "China", "Australia"]
123+
},
124+
combinatorial=True, # Generate all combinations
125+
num_random_samples=100 # Only needed if combinatorial is False. Then samples 100 at random.
126+
)
127+
)
128+
```
129+
130+
This expansion creates prompt variations by replacing `{{country}}` with all possible combinations of the provided values, dramatically increasing the diversity of your dataset.
131+
132+
## Step 4: Set Up LLM Providers
133+
134+
Configure one or more LLM providers to generate your dataset:
135+
136+
```python
137+
providers = [
138+
OpenAIProvider(model_id="gpt-4o-mini"),
139+
AnthropicProvider(model_id="claude-3-5-haiku-latest"),
140+
GoogleProvider(model_id="gemini-1.5-flash")
141+
]
142+
```
143+
144+
Using multiple providers helps create more diverse and robust datasets.
145+
146+
## Step 5: How Many Instances Will It Generate?
147+
148+
The number of generated instances in your dataset in combinatorial mode can be calculated by multiplying the following:
149+
150+
- number of document types (3 in our example)
151+
- number of topics (3 in our example)
152+
- number of languages (2 in our example)
153+
- number of samples per prompt (3 in our example)
154+
- number of LLM providers (3 in our example)
155+
- number of variations for each optional placeholder (if using prompt expansion)
156+
- For example: 6 for `{{country}}`.
157+
158+
With these numbers, and without prompt expansion, we'd generate: 3 × 3 × 2 × 3 × 3 = 162 instances.
159+
160+
With prompt expansion we further by the number of combinations from the optional placeholders (here 6): 3 × 3 × 2 × 3 × 3 × 6 = 972 instances.
161+
162+
If that seems sufficient and representative of your use case, we can proceed to generate the dataset.
163+
164+
In the real world this is a too small dataset for pre-training. But it can be one tiny slice of a much larger pre-training text corpus.
165+
166+
## Step 6: Generate the Dataset
167+
168+
Now you can create and generate your dataset:
169+
170+
```python
171+
# Initialize dataset with your configuration
172+
dataset = TextDataset(config)
173+
174+
# Generate examples using configured providers
175+
dataset.generate(providers)
176+
```
177+
178+
This will:
179+
1. Initialize a dataset with your configuration
180+
2. For each document type, topic, and language combination:
181+
- Create base prompts
182+
- Expand prompts with configured variations (if provided)
183+
- Call each LLM provider with each expanded prompt
184+
3. Save the dataset to the specified output file
185+
186+
## Step 7: Push to Hugging Face Hub (Optional)
187+
188+
After generating your dataset, you can push it to the Hugging Face Hub for sharing and version control:
189+
190+
```python
191+
USERNAME = "your_huggingface_username" # <--- Your Hugging Face username
192+
DATASET_NAME = "your_dataset_name" # <--- Your Hugging Face dataset name
193+
url = dataset.push_to_hub(
194+
repo_id=f"{USERNAME}/{DATASET_NAME}",
195+
train_size=0.8, # for a 80/20 train/test split, otherwise omit
196+
seed=20250304,
197+
shuffle=True,
198+
)
199+
print(f"\nDataset pushed to Hugging Face Hub: {url}")
200+
```
201+
202+
You don't need to specify a seed if you don't want to use train/test splitting:
203+
- If not provided, will push the entire dataset with train/test splits.
204+
205+
Make sure you have set your `HF_TOKEN` in the environment variables.
206+
207+
## Complete Example
208+
209+
Here's a complete example script that generates a text dataset across multiple document types, topics, and languages:
210+
211+
```python
212+
from datafast.datasets import TextDataset
213+
from datafast.schema.config import TextDatasetConfig, PromptExpansionConfig
214+
from datafast.llms import OpenAIProvider, AnthropicProvider, GoogleProvider
215+
216+
217+
def main():
218+
# 1. Configure the dataset generation
219+
config = TextDatasetConfig(
220+
document_types=["tech journalism blog", "personal blog", "MSc lecture notes"],
221+
topics=["technology", "artificial intelligence", "cloud computing"],
222+
num_samples_per_prompt=3,
223+
output_file="tech_posts.jsonl",
224+
languages={"en": "English", "fr": "French"},
225+
prompts=[
226+
(
227+
"Generate {num_samples} {document_type} entries in {language_name} about {topic}. "
228+
"The emphasis should be a perspective from {{country}}"
229+
)
230+
],
231+
expansion=PromptExpansionConfig(
232+
placeholders={
233+
"country": ["United States", "Europe", "Japan", "India", "China", "Australia"]
234+
},
235+
combinatorial=True,
236+
)
237+
)
238+
239+
# 2. Create LLM providers with specific models
240+
providers = [
241+
OpenAIProvider(model_id="gpt-4o-mini"),
242+
AnthropicProvider(model_id="claude-3-5-haiku-latest"),
243+
GoogleProvider(model_id="gemini-1.5-flash"),
244+
]
245+
246+
# 3. Generate the dataset
247+
dataset = TextDataset(config)
248+
dataset.generate(providers)
249+
250+
# 4. Push to HF hub (optional)
251+
# USERNAME = "your_huggingface_username"
252+
# DATASET_NAME = "your_dataset_name"
253+
# url = dataset.push_to_hub(
254+
# repo_id=f"{USERNAME}/{DATASET_NAME}",
255+
# train_size=0.8, # for a 80/20 train/test split, otherwise omit
256+
# seed=20250304,
257+
# shuffle=True,
258+
# )
259+
# print(f"\nDataset pushed to Hugging Face Hub: {url}")
260+
261+
262+
if __name__ == "__main__":
263+
from dotenv import load_dotenv
264+
265+
load_dotenv("secrets.env")
266+
main()
267+
```
268+
269+
## Conclusion
270+
271+
With `datafast`, you can easily generate diverse text datasets across multiple document types, topics, languages, and using multiple LLM providers. The generated datasets are saved in JSONL format and can be pushed to the Hugging Face Hub for sharing and version control.
272+
273+
The `TextDataset` class provides a simple interface for generating text data, while the `TextDatasetConfig` class allows you to configure the generation process in detail. By using prompt expansion, you can create even more diverse datasets by generating multiple variations of your base prompts.
274+
275+
🚀 There is more to come with new feature for generating raw text datasets on the basis of seed texts, and also using personas for more diversity.

docs/index.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,25 +10,18 @@ It is designed **to help you get the data you need** to:
1010
!!! warning
1111
This library is in its early stages of development and might change significantly.
1212

13-
### Key Features
14-
15-
* **Easy-to-use** and simple interface 🚀
16-
* **Multi-lingual** datasets generation 🌍
17-
* **Multiple LLMs** used to boost dataset diversity 🤖
18-
* **Flexible prompt**: default or custom 📝
19-
* **Prompt expansion** to maximize diversity 🔄
20-
* **Hugging Face Integration**: Push generated datasets to the Hub, soon to argilla 🤗
21-
2213
## Supported Dataset Types
2314

2415
Currently we support the following dataset types:
2516

2617
- ✅ Text Classification
2718
- ✅ Raw Text Generation
2819
- ✅ Instruction Dataset
29-
- UltraChat
20+
- UltraChat method
3021
- 📋 More coming soon!
3122

23+
⭐️ Star me if this is something you like!
24+
3225
## Quick Start
3326

3427
### 1. Environment Setup
@@ -109,9 +102,21 @@ dataset.push_to_hub(
109102
)
110103
```
111104

105+
### Key Features
106+
107+
* **Easy-to-use** and simple interface 🚀
108+
* **Multi-lingual** datasets generation 🌍
109+
* **Multiple LLMs** used to boost dataset diversity 🤖
110+
* **Flexible prompt**: default or custom 📝
111+
* **Prompt expansion** to maximize diversity 🔄
112+
* **Hugging Face Integration**: Push generated datasets to the Hub, soon to argilla 🤗
113+
112114
## Next Steps
113115

114-
* Check out our full example guide on [How to Generate a Text Classification Dataset](guides/generating_text_classification_datasets.md)
116+
Check out our guides for different dataset types:
117+
118+
* [How to Generate a Text Classification Dataset](guides/generating_text_classification_datasets.md)
119+
* [How to Create a Raw Text Dataset](guides/generating_text_datasets.md)
115120
* Visit our [GitHub repository](https://github.com/patrickfleith/datafast) for the latest updates
116121

117122
## Creator

mkdocs.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,9 @@ markdown_extensions:
5555
nav:
5656
- Home: index.md
5757
# - Getting Started: getting_started.md
58-
- Text Classification Example: guides/generating_text_classification_datasets.md
58+
- How To Guides:
59+
- Text Classification: guides/generating_text_classification_datasets.md
60+
- Text Generation: guides/generating_text_datasets.md
5961
- API Reference: api.md
6062

6163
# Plugins

0 commit comments

Comments
 (0)