A NeMo Data Designer plugin scaffold for generating realistic patient text messages to a configurable state-based primary-care clinic using persona seed data from the Nemotron Personas dataset. The default clinic context in this repo is Florida.
This scaffold is built around four ideas:
- Office workflow coverage: an extensive
message_types.yamlthat spans scheduling, labs, meds, referrals, monitoring, forms, preventive care, billing, records, and triage. - Clinical coverage: an extensive
conditions.yamlwith common primary-care conditions and symptom clusters. - Emergency modeling: a sampled
is_emergencyboolean plus anemergency_bandaxis (subtle,moderate,obvious) so urgent-positive rows can include red flags that range from easy-to-miss to clearly emergent. - Persona-aware phrasing: the main prompt uses persona, healthcare persona, profession, and Big Five traits to change what details patients volunteer and how they sound in SMS.
config/
clinic_context.yaml
message_types.yaml
conditions.yaml
emergency_axis.yaml
prompts/
main_prompt.md
label_candidates_prompt.md
message_types/*.md
conditions/*.md
src/patient_messages/
clinic_context.py
config.py
generator.py
label_columns.py
library.py
plugin.py
seed_builder.py
src/patient_messages_workflow/
builder.py
run.py
upload.py
scripts/
download_personas_from_ngc.sh
prepare_seed_manifest.py
examples/
run_preview_from_ngc_seed.py
The prepared seed manifest includes the full routing metadata used during generation, including the internal *_id fields.
The published final synthetic dataset keeps:
patient_messagecandidate_labelsmessage_type_nameworkflow_bucketcondition_nameis_emergencyemergency_bandselected_emergency_profile_title- persona columns from the Nemotron Personas seed data
The final dataset intentionally drops duplicate message_type_id, condition_id, and selected_emergency_profile_id columns because the paired name/title fields carry the same information more readably.
Emergency-positive rows are not just dramatic trauma cases. The prompt assets are intentionally biased toward realistic routing failures and subtle warning patterns such as:
- slowly worsening abdominal pain with weight loss or black stools
- chest discomfort written off as reflux or stress
- changing or bleeding skin lesions that the patient has ignored
- progressive fatigue, dizziness, or dyspnea the patient keeps working through
- worsening mental-health messages that sound restrained instead of dramatic
The emergency_band config defaults to:
subtle: 55%moderate: 25%obvious: 20%
You can tune those weights in config/emergency_axis.yaml.
uv venv
source .venv/bin/activate
uv sync
cp .env.example .envFill in .env with the keys you actually use. In this repo:
OPENROUTER_API_KEYis used for OpenRouter model calls.GEMINI_API_KEYis used for Gemini model calls.NGC_API_KEYis used by the persona download script if present.HF_TOKENis used by the Hugging Face upload script if present.
Download the Nemotron Personas dataset
./scripts/download_personas.shBy default the script downloads:
nvidia/nemotron-personas/nemotron-personas-dataset-en_us:0.0.2
Behavior notes:
- No NGC CLI is required; the script downloads directly from the NGC API with
curl. - The script automatically loads
.envif it exists. - If guest access is unavailable, the script uses
NGC_API_KEYfrom.envor your shell environment. - The default destination is repo-local
data/raw/. - For
en_US, the default full-file path isdata/raw/nemotron-personas-dataset-en_us/en_US.parquet. - If that file already exists, the script skips it and does not re-download it.
- You can override the destination with
PERSONAS_DEST=/some/path ./scripts/download_personas.sh. scripts/download_personas_from_ngc.shremains as a compatibility wrapper around the new script.
The seed-builder combines each persona row with sampled workflow, condition, and emergency axes.
Current sampling behavior:
- load the persona table from parquet/jsonl/json/csv,
- optionally filter persona rows by
state, - randomly sample
sample_sizepersona rows withpandas.DataFrame.sample(...), - expand each matched persona into
rows_per_personasynthetic message scenarios, - sample message type, condition,
is_emergency, andemergency_bandfrom the YAML-configured weights.
By default, the repo reads config/clinic_context.yaml and uses clinic.state as the persona-state filter. Out of the box, that means only Florida personas are used unless you override the filter.
Example:
uv run python scripts/prepare_seed_manifest.py --personas-path data/raw/nemotron-personas-dataset-en_us --output-path data/seed/primary_care_persona_message_seed.parquet --sample-size 500 --rows-per-persona 2 --seed 7Notes:
--sample-sizelets you preview on a subset of personas.--rows-per-personalets you produce multiple message scenarios per persona.--personas-pathcan point to a parquet/jsonl/json/csv file or a directory that contains one.--persona-stateoverrides the default clinic-state filter. Example:--persona-state California.--no-persona-state-filterdisables state filtering and samples from the full persona table.config/clinic_context.yamlcontrols the default clinic description and default state filter used by the repo examples.
uv run python examples/run_preview_from_ngc_seed.pyThe preview script:
- loads the prepared seed manifest,
- uses the plugin column
primary-care-prompt-bundleto render the final message-generation prompt, - generates
patient_messagewith an LLM text column, - generates five hidden intermediate label columns with LLM text columns,
- combines those five hidden columns into the final
candidate_labelslist column with a custom Data Designer column.
Use the uv entrypoint when you want a real dataset build instead of a preview. This command:
- loads the persona dataset,
- applies the default or overridden persona-state filter,
- prepares the seed manifest automatically,
- configures Data Designer to use the provider/model you selected,
- generates the final dataset artifact folder.
Examples:
uv run patient-messages-run \
--provider openrouter \
--llm-id openai/gpt-5-nano \
--max-parallel-requests 4 \
--dataset-name patient-messages-florida \
--num-records 1000uv run patient-messages-run \
--provider gemini \
--provider-endpoint https://generativelanguage.googleapis.com/v1beta/openai/ \
--provider-type openai \
--api-key-env GEMINI_API_KEY \
--llm-id gemini-3.1-flash-lite-preview \
--max-parallel-requests 4 \
--dataset-name patient-messages-florida \
--num-records 1000Optional controls:
--persona-state Californiaoverrides the default clinic-state filter.--no-persona-state-filtersamples from the full persona table.--sample-sizeand--rows-per-personacontrol seed-manifest expansion before the final trim to--num-records.--temperature,--top-p,--max-tokens,--max-parallel-requests, and--timeouttune the selected model call.--provider-endpoint,--provider-type, and--api-key-envlet you define a non-built-in provider or override a built-in provider.--max-parallel-requestsdefaults to4. This repo keeps that default for both OpenRouter and Gemini. Google’s Gemini docs publish project-level RPM/TPM/RPD quotas, not a provider-wide safe concurrency number, and preview models can be more restricted. If you hit429or quota errors on Gemini, lower--max-parallel-requestsfirst.
Resume behavior:
- Runs are stored under
data/runs/<dataset-name>/<run-id>/. - The run id is derived from the normalized CLI spec plus a digest of the repo inputs (
config/,prompts/,src/patient_messages/,pyproject.toml) and the persona source file metadata. - Re-running the same effective command reuses the existing completed output instead of regenerating it.
- If a previous run failed, the workflow reuses the saved
seed_manifest.parquetand reruns generation. - Pass
--resetto delete the matching resumable run and rebuild from scratch.
Each run directory contains:
run_spec.jsonrun_state.jsonseed_manifest.parquetartifacts/<dataset-name>/...
The installed data-designer package already includes a Hugging Face upload client. This repo adds a thin CLI wrapper around it so you can upload a completed Data Designer artifact folder with normal command-line args.
The uploader expects a Data Designer artifact directory that contains at least:
metadata.jsonparquet-files/- optionally
builder_config.json
Example:
uv run patient-messages-upload \
--dataset-path /path/to/data-designer-output \
--org your-hf-org \
--dataset-name patient-messages-florida \
--description "Synthetic Florida primary-care SMS dataset" \
--tag healthcare \
--tag syntheticNotes:
- Authentication uses
--token,HF_TOKEN, or cachedhf auth logincredentials. - The upload script automatically loads
.env, soHF_TOKENfrom that file works without extra shell setup. - Uploads are public by default; add
--privatewhen you want a private dataset repo. - The uploaded dataset keeps the seed/persona context columns alongside
patient_messageandcandidate_labels, including workflow metadata such as message type, condition, and emergency fields. - The uploader auto-generates
README.mdfor the Hugging Face dataset card from the uploaded parquet files, including split info plus composition tables for message types, conditions, workflow buckets, emergency axes, and patient demographics when those columns are present. - The script prints the final dataset URL on success.
The plugin adds a deterministic custom column type:
primary-care-prompt-bundle
Its only job is to render the final prompt by combining:
- message type prompt
- condition prompt
- emergency metadata
- persona snapshot
That keeps the actual text generation in standard Data Designer LLM columns.
Each file under prompts/message_types/ describes:
- the office workflow,
- what the patient is usually trying to accomplish,
- details that should appear when natural,
- workflow-specific style constraints.
Each file under prompts/conditions/ describes:
- common symptoms,
- routine details to mention,
- emergency escalation guidance,
- subtle/moderate/obvious urgent patterns.
The emergency sections are meant to be conditionally injected only when is_emergency=true.
prompts/main_prompt.md unifies:
- workflow/message type,
- condition,
- emergency axis,
- clinic context,
- persona snapshot.
prompts/label_candidates_prompt.md now generates one label at a time. The preview flow creates five dropped intermediate label columns and then assembles them into the final candidate_labels list column.
Common edits you may want to make:
- change the default clinic state or description in
config/clinic_context.yaml - change the emergency-positive rate in
config/emergency_axis.yaml - change subtle vs obvious urgent prevalence in
config/emergency_axis.yaml - reweight message types in
config/message_types.yaml - reweight conditions in
config/conditions.yaml - override persona sampling geography with
--persona-stateor disable it with--no-persona-state-filter - bias admin-only or condition-heavy messages by editing the seed-builder logic in
seed_builder.py - make the clinic more or less formal by editing
prompts/main_prompt.md
- Download and inspect the raw persona data.
- Build a prepared seed manifest with your target sampling mix.
- Run Data Designer preview until the tone and routing realism look right.
- Generate a larger dataset with
uv run patient-messages-run. - Keep the gold columns (
message_type_name,condition_name,workflow_bucket,is_emergency,emergency_band) for supervised training and evaluation.
This scaffold targets the current Data Designer plugin shape used by the open-source data-designer package and the dd. import style. If you pin a substantially older release, update the plugin entry point or imports to match that release.