Skip to content

ToadResearch/PatientMessages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Patient Messages

A NeMo Data Designer plugin scaffold for generating realistic patient text messages to a configurable state-based primary-care clinic using persona seed data from the Nemotron Personas dataset. The default clinic context in this repo is Florida.

This scaffold is built around four ideas:

  1. Office workflow coverage: an extensive message_types.yaml that spans scheduling, labs, meds, referrals, monitoring, forms, preventive care, billing, records, and triage.
  2. Clinical coverage: an extensive conditions.yaml with common primary-care conditions and symptom clusters.
  3. Emergency modeling: a sampled is_emergency boolean plus an emergency_band axis (subtle, moderate, obvious) so urgent-positive rows can include red flags that range from easy-to-miss to clearly emergent.
  4. Persona-aware phrasing: the main prompt uses persona, healthcare persona, profession, and Big Five traits to change what details patients volunteer and how they sound in SMS.

Project layout

config/
  clinic_context.yaml
  message_types.yaml
  conditions.yaml
  emergency_axis.yaml

prompts/
  main_prompt.md
  label_candidates_prompt.md
  message_types/*.md
  conditions/*.md

src/patient_messages/
  clinic_context.py
  config.py
  generator.py
  label_columns.py
  library.py
  plugin.py
  seed_builder.py

src/patient_messages_workflow/
  builder.py
  run.py
  upload.py

scripts/
  download_personas_from_ngc.sh
  prepare_seed_manifest.py

examples/
  run_preview_from_ngc_seed.py

What gets generated

The prepared seed manifest includes the full routing metadata used during generation, including the internal *_id fields.

The published final synthetic dataset keeps:

  • patient_message
  • candidate_labels
  • message_type_name
  • workflow_bucket
  • condition_name
  • is_emergency
  • emergency_band
  • selected_emergency_profile_title
  • persona columns from the Nemotron Personas seed data

The final dataset intentionally drops duplicate message_type_id, condition_id, and selected_emergency_profile_id columns because the paired name/title fields carry the same information more readably.

Emergency modeling

Emergency-positive rows are not just dramatic trauma cases. The prompt assets are intentionally biased toward realistic routing failures and subtle warning patterns such as:

  • slowly worsening abdominal pain with weight loss or black stools
  • chest discomfort written off as reflux or stress
  • changing or bleeding skin lesions that the patient has ignored
  • progressive fatigue, dizziness, or dyspnea the patient keeps working through
  • worsening mental-health messages that sound restrained instead of dramatic

The emergency_band config defaults to:

  • subtle: 55%
  • moderate: 25%
  • obvious: 20%

You can tune those weights in config/emergency_axis.yaml.

Setup with uv

uv venv
source .venv/bin/activate
uv sync
cp .env.example .env

Fill in .env with the keys you actually use. In this repo:

  • OPENROUTER_API_KEY is used for OpenRouter model calls.
  • GEMINI_API_KEY is used for Gemini model calls.
  • NGC_API_KEY is used by the persona download script if present.
  • HF_TOKEN is used by the Hugging Face upload script if present.

Download the Nemotron Personas dataset

./scripts/download_personas.sh

By default the script downloads:

nvidia/nemotron-personas/nemotron-personas-dataset-en_us:0.0.2

Behavior notes:

  • No NGC CLI is required; the script downloads directly from the NGC API with curl.
  • The script automatically loads .env if it exists.
  • If guest access is unavailable, the script uses NGC_API_KEY from .env or your shell environment.
  • The default destination is repo-local data/raw/.
  • For en_US, the default full-file path is data/raw/nemotron-personas-dataset-en_us/en_US.parquet.
  • If that file already exists, the script skips it and does not re-download it.
  • You can override the destination with PERSONAS_DEST=/some/path ./scripts/download_personas.sh.
  • scripts/download_personas_from_ngc.sh remains as a compatibility wrapper around the new script.

Prepare the seed manifest

The seed-builder combines each persona row with sampled workflow, condition, and emergency axes.

Current sampling behavior:

  1. load the persona table from parquet/jsonl/json/csv,
  2. optionally filter persona rows by state,
  3. randomly sample sample_size persona rows with pandas.DataFrame.sample(...),
  4. expand each matched persona into rows_per_persona synthetic message scenarios,
  5. sample message type, condition, is_emergency, and emergency_band from the YAML-configured weights.

By default, the repo reads config/clinic_context.yaml and uses clinic.state as the persona-state filter. Out of the box, that means only Florida personas are used unless you override the filter.

Example:

uv run python scripts/prepare_seed_manifest.py   --personas-path data/raw/nemotron-personas-dataset-en_us   --output-path data/seed/primary_care_persona_message_seed.parquet   --sample-size 500   --rows-per-persona 2   --seed 7

Notes:

  • --sample-size lets you preview on a subset of personas.
  • --rows-per-persona lets you produce multiple message scenarios per persona.
  • --personas-path can point to a parquet/jsonl/json/csv file or a directory that contains one.
  • --persona-state overrides the default clinic-state filter. Example: --persona-state California.
  • --no-persona-state-filter disables state filtering and samples from the full persona table.
  • config/clinic_context.yaml controls the default clinic description and default state filter used by the repo examples.

Run a Data Designer preview

uv run python examples/run_preview_from_ngc_seed.py

The preview script:

  1. loads the prepared seed manifest,
  2. uses the plugin column primary-care-prompt-bundle to render the final message-generation prompt,
  3. generates patient_message with an LLM text column,
  4. generates five hidden intermediate label columns with LLM text columns,
  5. combines those five hidden columns into the final candidate_labels list column with a custom Data Designer column.

Run the full workflow

Use the uv entrypoint when you want a real dataset build instead of a preview. This command:

  1. loads the persona dataset,
  2. applies the default or overridden persona-state filter,
  3. prepares the seed manifest automatically,
  4. configures Data Designer to use the provider/model you selected,
  5. generates the final dataset artifact folder.

Examples:

uv run patient-messages-run \
  --provider openrouter \
  --llm-id openai/gpt-5-nano \
  --max-parallel-requests 4 \
  --dataset-name patient-messages-florida \
  --num-records 1000
uv run patient-messages-run \
  --provider gemini \
  --provider-endpoint https://generativelanguage.googleapis.com/v1beta/openai/ \
  --provider-type openai \
  --api-key-env GEMINI_API_KEY \
  --llm-id gemini-3.1-flash-lite-preview \
  --max-parallel-requests 4 \
  --dataset-name patient-messages-florida \
  --num-records 1000

Optional controls:

  • --persona-state California overrides the default clinic-state filter.
  • --no-persona-state-filter samples from the full persona table.
  • --sample-size and --rows-per-persona control seed-manifest expansion before the final trim to --num-records.
  • --temperature, --top-p, --max-tokens, --max-parallel-requests, and --timeout tune the selected model call.
  • --provider-endpoint, --provider-type, and --api-key-env let you define a non-built-in provider or override a built-in provider.
  • --max-parallel-requests defaults to 4. This repo keeps that default for both OpenRouter and Gemini. Google’s Gemini docs publish project-level RPM/TPM/RPD quotas, not a provider-wide safe concurrency number, and preview models can be more restricted. If you hit 429 or quota errors on Gemini, lower --max-parallel-requests first.

Resume behavior:

  • Runs are stored under data/runs/<dataset-name>/<run-id>/.
  • The run id is derived from the normalized CLI spec plus a digest of the repo inputs (config/, prompts/, src/patient_messages/, pyproject.toml) and the persona source file metadata.
  • Re-running the same effective command reuses the existing completed output instead of regenerating it.
  • If a previous run failed, the workflow reuses the saved seed_manifest.parquet and reruns generation.
  • Pass --reset to delete the matching resumable run and rebuild from scratch.

Each run directory contains:

  • run_spec.json
  • run_state.json
  • seed_manifest.parquet
  • artifacts/<dataset-name>/...

Upload to Hugging Face

The installed data-designer package already includes a Hugging Face upload client. This repo adds a thin CLI wrapper around it so you can upload a completed Data Designer artifact folder with normal command-line args.

The uploader expects a Data Designer artifact directory that contains at least:

  • metadata.json
  • parquet-files/
  • optionally builder_config.json

Example:

uv run patient-messages-upload \
  --dataset-path /path/to/data-designer-output \
  --org your-hf-org \
  --dataset-name patient-messages-florida \
  --description "Synthetic Florida primary-care SMS dataset" \
  --tag healthcare \
  --tag synthetic

Notes:

  • Authentication uses --token, HF_TOKEN, or cached hf auth login credentials.
  • The upload script automatically loads .env, so HF_TOKEN from that file works without extra shell setup.
  • Uploads are public by default; add --private when you want a private dataset repo.
  • The uploaded dataset keeps the seed/persona context columns alongside patient_message and candidate_labels, including workflow metadata such as message type, condition, and emergency fields.
  • The uploader auto-generates README.md for the Hugging Face dataset card from the uploaded parquet files, including split info plus composition tables for message types, conditions, workflow buckets, emergency axes, and patient demographics when those columns are present.
  • The script prints the final dataset URL on success.

Plugin column

The plugin adds a deterministic custom column type:

primary-care-prompt-bundle

Its only job is to render the final prompt by combining:

  • message type prompt
  • condition prompt
  • emergency metadata
  • persona snapshot

That keeps the actual text generation in standard Data Designer LLM columns.

Prompt design notes

Message type prompts

Each file under prompts/message_types/ describes:

  • the office workflow,
  • what the patient is usually trying to accomplish,
  • details that should appear when natural,
  • workflow-specific style constraints.

Condition prompts

Each file under prompts/conditions/ describes:

  • common symptoms,
  • routine details to mention,
  • emergency escalation guidance,
  • subtle/moderate/obvious urgent patterns.

The emergency sections are meant to be conditionally injected only when is_emergency=true.

Main prompt

prompts/main_prompt.md unifies:

  • workflow/message type,
  • condition,
  • emergency axis,
  • clinic context,
  • persona snapshot.

Label prompt

prompts/label_candidates_prompt.md now generates one label at a time. The preview flow creates five dropped intermediate label columns and then assembles them into the final candidate_labels list column.

Tuning ideas

Common edits you may want to make:

  • change the default clinic state or description in config/clinic_context.yaml
  • change the emergency-positive rate in config/emergency_axis.yaml
  • change subtle vs obvious urgent prevalence in config/emergency_axis.yaml
  • reweight message types in config/message_types.yaml
  • reweight conditions in config/conditions.yaml
  • override persona sampling geography with --persona-state or disable it with --no-persona-state-filter
  • bias admin-only or condition-heavy messages by editing the seed-builder logic in seed_builder.py
  • make the clinic more or less formal by editing prompts/main_prompt.md

Suggested workflow for dataset creation

  1. Download and inspect the raw persona data.
  2. Build a prepared seed manifest with your target sampling mix.
  3. Run Data Designer preview until the tone and routing realism look right.
  4. Generate a larger dataset with uv run patient-messages-run.
  5. Keep the gold columns (message_type_name, condition_name, workflow_bucket, is_emergency, emergency_band) for supervised training and evaluation.

Compatibility note

This scaffold targets the current Data Designer plugin shape used by the open-source data-designer package and the dd. import style. If you pin a substantially older release, update the plugin entry point or imports to match that release.

About

Synthetic patient messages to primary care facilities using Data Designer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors