AI Desktop Mentor is an advanced Python-based desktop automation tool designed to emulate human-like interactions with a computer system. It leverages cutting-edge AI technologies, including YOLOv8 for UI element detection, Vosk for offline speech recognition, and DistilBERT for natural language processing (NLP), to perform tasks such as opening applications, navigating websites, logging in, and processing screenshots. With a Tkinter GUI, it supports voice commands, task scripting, and automated workflows—ideal for business automation and personal productivity.
- Automation: Open apps (e.g., Chrome, Notepad), type text, navigate URLs, and log in to websites.
- Screenshots: Capture manually (
Ctrl+Shift+S) or auto (every 15 minutes). - AI Navigation: YOLOv8 detects UI elements (e.g., login fields); OCR reads screen text.
- Task Scripting: Execute sequences defined in
tasks.json. - Voice Control: Use offline voice commands via Vosk.
- NLP Understanding: Parse natural language with DistilBERT.
- Context Awareness: Detect CAPTCHAs/pop-ups with OCR.
- Cross-Platform: Works on Windows, macOS, and Linux.
AIDesktopMentor/
├── automation/
│ └── automation\_tool.py
├── config/
│ └── tasks.json
├── docs/
│ ├── README.md
│ └── requirements.txt
├── models/
│ ├── yolo\_ui\_model.pth
│ └── vosk-model-small-en-us/
├── outputs/
│ └── screenshots/
├── dataset/
│ ├── images/
│ │ ├── train/
│ │ └── val/
│ ├── labels/
│ │ ├── train/
│ │ └── val/
│ └── data.yaml
graph TD
A[User Input] --> B[GUI Tkinter]
A --> C[Voice Listener Vosk]
C --> D[NLP Parser DistilBERT]
D --> E[Command Processor]
B --> E
E --> F[Automation Engine PyAutoGUI]
E --> G[UI Detection YOLOv8]
E --> H[Screenshot Module]
E --> I[OCR Pytesseract]
E --> J[Check Popups]
F --> K[OS Interaction]
H --> L[Save to screenshots/]
I --> M[Context Feedback]
K --> N[Screen Output]
M --> N
graph TD
A[Business User] --> B[Define Task]
B -->|Manual| C[GUI Interaction]
B -->|Automated| D[Configure tasks.json]
B -->|Voice| E[Voice Command]
C --> F[Execute Task]
D --> F
E --> F
F -->|Open App| G[Access System]
F -->|Login| H[Authenticate]
F -->|Navigate| I[Access Resource]
F -->|Screenshot| J[Generate Report]
H -->|YOLO Detection| I
I --> K[Perform Business Function]
J --> L[Save Output]
K --> M[Business Outcome]
L --> M
-
Python 3.8+
-
Tesseract OCR
- Windows: Install
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr
-
Vosk Model
- Download
vosk-model-small-en-us - Extract into
models/vosk-model-small-en-us/
- Download
-
YOLO Model
- Use
yolov8n.ptor custom-trained model saved asyolo_ui_model.pt
- Use
-
Python dependencies
pip install -r requirements.txt
-
Microphone access + Permissions (macOS/Linux screen recording/input).
# Clone the repo
git clone https://github.com/moses000/AIDesktopMentor.git
cd AIDesktopMentor
# Set up folder structure
mkdir -p outputs/screenshots dataset/images/train dataset/images/val dataset/labels/train dataset/labels/val
# Install dependencies
pip install -r requirements.txtDon't forget to install Tesseract OCR, Vosk model, and YOLO model.
from ultralytics import YOLO
model = YOLO("yolov8n.pt")mv yolov8n.pt models/yolo_ui_model.ptAccuracy for UI tasks may be limited.
- Capture screenshots
import pyautogui, time
for i in range(100):
pyautogui.screenshot(f"dataset/images/train/login_{i}.png")
time.sleep(2)- Label with LabelImg
pip install labelImg
labelImg dataset/images/train dataset/labels/train- Create
data.yaml
train: dataset/images/train/
val: dataset/images/val/
nc: 3
names: ['username_field', 'password_field', 'login_button']- Train
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
model.train(data="dataset/data.yaml", epochs=50, imgsz=640, batch=16)- Save model
cp runs/train/exp/weights/best.pt models/yolo_ui_model.ptpython automation/automation_tool.py- Open Notepad & type
- Execute
tasks.json - Take screenshot
- OCR read screen
- Login via GUI
- Enable voice commands
- "open Chrome"
- "go to example.com"
- "log in to example.com"
- "type hello world"
- "take screenshot"
- "read text"
- "execute tasks"
- "stop listening"
[
{
"action": "open",
"app": "chrome"
},
{
"action": "navigate",
"url": "https://example.com"
},
{
"action": "login",
"url": "https://example.com"
},
{
"action": "screenshot",
"prefix": "login_task"
}
]- Permissions: macOS/Linux may need screen/microphone/input access.
- YOLO: Required for login automation.
- Vosk: Ensure correct folder structure in
models/. - Performance Tip: Keep automation interval ≥ 5s to avoid resource strain.
pip install pyinstaller
pyinstaller --onefile automation/automation_tool.py- YOLO Errors: Check
yolo_ui_model.pt& class IDs - Vosk Errors: Confirm model directory/mic permissions
- GUI Not Working: Verify Python/Tkinter setup
- Expand YOLO UI detection classes
- Add reinforcement learning for adaptive workflows
- CAPTCHA solvers
- Larger NLP models (e.g., BERT)
- GUI task builder for
tasks.json
MIT License
Open issues or submit pull requests on GitHub.
For support, create an issue or email im.imoleayomoses@gmail.com
- Ultralytics for YOLOv8
- Vosk for speech recognition
- Hugging Face for transformers