Skip to content

PlatformNetwork/data-fabrication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

data-fabrication

Conversation Dataset Generator β€” WASM Evaluation Module for Platform

License Rust WASM

Data Fabrication is a WASM evaluation module for generating and validating AI training datasets on the Bittensor network. It runs inside platform validators to evaluate miner submissions that produce conversation datasets. Miners submit Python harnesses that generate synthetic conversations, and the network scores them through a multi-stage pipeline including AST structural similarity checks and LLM-based plagiarism detection.


TL;DR

# Build WASM module
cargo build --target wasm32-unknown-unknown -p data-fabrication-wasm

# Build CLI tool
cargo build --release -p df-cli

# Run the TUI monitor
df-cli monitor

System Architecture

flowchart LR
    Miner[Miner] -->|Submit Python Harness| RPC[Validator RPC]
    RPC --> Validators[Validator Network]
    Validators --> WASM[data-fabrication WASM]
    WASM --> Storage[(Blockchain Storage)]
    Validators --> Executor[df-executor]
    Executor -->|Dataset Results| Validators
    Validators -->|Scores + Weights| BT[Bittensor Chain]
    CLI[df-cli TUI] -->|JSON-RPC| RPC
    CLI -->|Display| Monitor[Progress / Logs / Leaderboard]
Loading

Evaluation Pipeline

sequenceDiagram
    participant M as Miner
    participant V as Validators
    participant W as WASM Module
    participant E as df-executor
    participant BT as Bittensor

    M->>V: Submit Python harness (JSON)
    V->>W: Store code, validate format
    W-->>V: Validation pass/fail
    V->>W: Run AST similarity check
    W-->>V: Similarity score
    V->>E: Execute harness, generate dataset
    E-->>V: Generated conversations
    V->>W: LLM plagiarism evaluation
    W-->>V: Plagiarism verdict
    V->>W: Compute final score
    V->>BT: Submit weights at epoch boundary
Loading

Similarity Detection Flow

flowchart TB
    Code[Python Harness] --> Parse[Parse AST]
    Parse --> Normalize[Normalize Variables]
    Normalize --> Hash[Structure Hash]
    Hash --> Compare[Compare with Others]
    Compare --> LCS[LCS Algorithm]
    LCS --> Score[Similarity Score 0-100]
    Score --> Status{Plagiarism Status}
    Status -->|>= 97%| Plagiarized[Plagiarized]
    Status -->|30-96%| NeedsLLM[NeedsLlmVerification]
    Status -->|< 30%| Clean[Clean]
Loading

Features

  • WASM Module: Compiles to wasm32-unknown-unknown, loaded by platform validators
  • AST Structural Similarity: Normalizes Python code and compares structure via LCS algorithm
  • LLM Plagiarism Detection: Retry-enabled LLM inference for semantic comparison
  • Submission Validation: Size limits, format checks, and signature verification
  • Conversation Dataset Generation: Python harnesses produce JSONL conversation datasets
  • Resource Limits: CPU time, memory, and file size constraints for sandboxed execution
  • Plagiarism Clustering: Groups similar submissions by structure hash prefix
  • CLI (df-cli): Native TUI for monitoring evaluations and network status

Installation

# Via Platform CLI (recommended)
platform download data-fabrication

# Or build from source
git clone https://github.com/PlatformNetwork/data-fabrication
cd data-fabrication
cargo build --release

Building

# Build WASM module (for platform validators)
cargo build --target wasm32-unknown-unknown -p data-fabrication-wasm

# The output .wasm file is at:
# target/wasm32-unknown-unknown/release/data_fabrication_wasm.wasm

# Build CLI (native)
cargo build --release -p df-cli

# Build executor
cargo build --release -p df-executor

# Build all workspace members
cargo build --release --workspace

Usage

# Launch interactive TUI (connects to https://chain.platform.network)
df-cli monitor

# Submit a Python harness
df-cli submit --harness ./my-harness/

# Check submission status
df-cli status --hotkey 5Abc...

# Monitor a specific miner
df-cli --hotkey 5GrwvaEF... monitor

# Custom RPC endpoint
df-cli --rpc-url http://localhost:8080 monitor

Subcommands: submit Β· status Β· monitor (default)

TUI Controls: Tab/Shift+Tab switch tabs Β· ↑/↓ scroll Β· r refresh Β· q quit


Architecture

data-fabrication/
β”œβ”€β”€ wasm/                   # WASM evaluation module (compiled to wasm32-unknown-unknown)
β”‚   └── src/
β”‚       β”œβ”€β”€ lib.rs              # Challenge trait implementation
β”‚       └── types.rs            # Submission and config types
β”œβ”€β”€ core/                   # Shared types (no_std compatible)
β”‚   └── src/
β”‚       β”œβ”€β”€ lib.rs              # Domain types (HarnessSubmission, GeneratedDataset)
β”‚       β”œβ”€β”€ ast_similarity.rs   # AST normalization, structure hashing, LCS comparison
β”‚       β”œβ”€β”€ ast_validation.rs   # Python code security validation
β”‚       β”œβ”€β”€ scoring_types.rs    # Score types (ConversationScore, DatasetScore)
β”‚       β”œβ”€β”€ schema.rs           # JSONL parsing for conversation datasets
β”‚       β”œβ”€β”€ consensus.rs        # Multi-validator consensus
β”‚       β”œβ”€β”€ cache.rs            # Evaluation result caching
β”‚       └── resource_limits.rs  # CPU, memory, file constraints
β”œβ”€β”€ executor/               # Native execution engine
β”‚   └── src/
β”‚       β”œβ”€β”€ lib.rs              # Executor entry point
β”‚       └── llm_inference.rs    # LLM client with retry logic
β”œβ”€β”€ cli/                    # Native TUI monitoring tool
β”‚   └── src/
β”‚       β”œβ”€β”€ main.rs             # Entry point, event loop
β”‚       β”œβ”€β”€ app.rs              # Application state
β”‚       β”œβ”€β”€ ui.rs               # Ratatui UI rendering
β”‚       └── rpc.rs              # JSON-RPC 2.0 client
β”œβ”€β”€ server/                 # Native HTTP server
β”‚   └── src/
β”‚       └── main.rs             # HTTP evaluation server
└── src/                    # Root crate library
    └── lib.rs                  # HuggingFace dataset handler

How It Works

  1. Miners submit Python harness code via df-cli submit
  2. Platform validators load this WASM module
  3. WASM validates submission format, size limits, and signature
  4. Executor runs the Python harness in a sandboxed environment
  5. Generated conversations are parsed from JSONL output
  6. AST similarity compares submission structure against others using normalized variables and LCS algorithm
  7. LLM inference performs semantic plagiarism detection with retry logic
  8. Final score combines dataset quality and originality metrics
  9. Validators submit weights to Bittensor at epoch boundaries

AST Similarity System

The plagiarism detection uses a two-pass approach:

Pass 1: Structural Comparison

# Original code
x = 1
y = x + 2

# After normalization (both produce identical AST)
a = 1
b = a + 2

Variables are normalized to var_0, var_1, etc., and the AST structure is hashed. Submissions with matching structure hashes are clustered for detailed comparison.

Pass 2: LLM Semantic Analysis

For submissions above a similarity threshold, an LLM evaluates:

  • Logic flow patterns
  • Naming convention similarities
  • Comment and docstring patterns

Documentation


License

Apache-2.0

About

[πŸ”¬] Q4 2025 : data-fabrication is a challenge project from the Platform subnet, where developers are incentivized to create diverse and high-performance datasets. Datasets are evaluated in isolated environments, rewarded based on quality and utility, and continuously improved through encrypted and competitive collaboration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages