A Python-based interactive CLI interface for chatting with Hugging Face language models, optimized for casual, Discord-style conversation using ChatML. Supports both quantized and full-precision models, live token streaming with color formatting, and dynamic generation parameter adjustment.
-
Multiple Model Formats
- Hugging Face Transformers (
AutoModelForCausalLM) - GGUF (llama.cpp) backend
- LoRA adapter loading
- 4-bit / 8-bit quantization with bitsandbytes
- Hugging Face Transformers (
-
Custom Prompt Controls
- Chain-of-Thought context management
- Raw blank mode, no system prompts, or assistant-only modes
- DeepHermes and ChatML formatting options
- Optional code detection and filtering
-
Interactive Chat
- Multi-line input with
prompt_toolkit - Persistent conversation history (
/back,/clear) - Runtime parameter adjustment (
/min,/max,/temp,/p,/k,/r,/rh)
- Multi-line input with
-
Streaming Output
- Token-by-token display with Rich coloring
- Emoji filtering and cleanup
- Automatic lowercasing rules
- EOS-Aware Extension: starts with a short randomized budget (40–75 tokens), then automatically extends generation in steps (64 tokens) until
<|im_end|>or EOS is reached, a hard cap (1024 tokens), or manual/stopis triggered
Install with requirements.txt:
pip install -r requirements.txt
Or install manually:
pip install torch transformers peft bitsandbytes prompt_toolkit rich
If using GGUF (llama.cpp models):
pip install llama-cpp-python
usage: interface.py [-h] [-c] [-m MODEL]
[--deephermes] [--gguf] [--gguf-chat-format FORMAT]
[--blank] [--assistant-system-combo] [--assistant-system]
[--just-system-prompt] [--no-system-prompt]
[--no-assistant-prompt] [--code-check]
[--quantization] [--bnb-4bit] [--bnb-8bit]
[--custom-tokens]
optional arguments:
-h, --help Show this help message and exit
-m MODEL, --model MODEL Model path or Hugging Face repo ID
(default: mookiezii/Discord-Hermes-3-8B)
Feature toggles:
-m, --model Model path or Hugging Face repo ID (default: mookiezii/Discord-Hermes-3-8B)
-q, --quant Quantization mode: 4 or 8 (default: off). Use `-q` (no value) for 4-bit, or `-q 8` for 8-bit
-fl, --frozen-lora Model path or Hugging Face repo ID of the base LoRa adapter to load and freeze
-c, --checkpoint Model path or Hugging Face repo ID of the LoRa adapter to load
-chs, --checkpoint-subfolder Subfolder of the path or Hugging Face repo ID of the LoRa adapter to load
--deephermes Enable DeepHermes formatting instead of ChatML
--gguf Use GGUF model format with llama.cpp backend
--gguf-chat-format Chat format for GGUF models (default: "chatml")
--blank Raw user input only, no prompts/system context
-asc, --assistant-system-combo Include both system and assistant system prompts
-as, --assistant-system Use assistant system prompt instead of standard
--just-system-prompt Use only the system prompt with user input
--no-system-prompt Do not include system prompt
--no-assistant-prompt Do not include assistant prompt
--code-check Enable code detection and filtering via classifier
-au, --auto Run preset inputs (hello → what do you do → wow tell me more) 5 times with /clear in between, then exit
- MIN_NEW_TOKENS = 1
- MAX_NEW_TOKENS =
random.randint(40, 75) - TEMPERATURE =
random.uniform(0.5, 0.9) - TOP_P =
random.uniform(0.7, 0.9) - TOP_K =
random.randint(40, 75) - MIN_P = 0.08
- NO_REPEAT_NGRAM_SIZE = 3
- REPETITION_PENALTY = 1.2
- EOS Handling =
<|im_end|>andtokenizer.eos_token_id(extension continues until one is reached, or hard cap of 1024 tokens)
| Command | Description |
|---|---|
/clear /reset /c |
Clear conversation history |
/back /b |
Undo last user+assistant exchange and preview recent history |
/h VAL |
Enable Chain-of-Thought with last VAL exchanges (default: all available) |
/d |
Disable Chain-of-Thought |
/min VAL |
Set min_new_tokens to VALb |
/max VAL |
Set max_new_tokens to VAL |
/temp VAL or /t VAL |
Set temperature to VAL |
/p VAL |
Set top_p to VAL |
/k VAL |
Set top_k to VAL |
/params /settings |
Show current generation parameters |
/r |
Randomize parameters (short-range defaults) |
/rh |
Randomize parameters with high variance (wider temp/top_p/top_k ranges) |
/stop |
Toggle extension ON/OFF (controls continuation beyond initial budget) |
MIT License
