GradeLens is an end-to-end AI auto-grading system designed to help instructors evaluate student answers quickly, accurately, and consistently. It uses a powerful Retrieval-Augmented Generation (RAG) pipeline, LLMs, and a scalable FastAPI backend to grade student responses based on reference materials, rubrics, and course content.
The platform supports complete academic workflows—course creation, student management, assessment setup, note uploads, exam submissions, grading, feedback generation, and more—automated using AI and backed by robust SQL data models.
- AI-Driven Auto-Grading using RAG, embeddings, and LLM reasoning
- PDF Ingestion + Chunking for course materials
- Rubric-Based Evaluation with customizable parameters
- Parallel Grading Engine that evaluates multiple answers simultaneously
- Complete API Suite for courses, students, assessments, submissions, and grading
- Secure Authentication + User Management
- Full CRUD Support for all major entities (courses, notes, exams, assessments, submissions)
- Python – core backend logic
- FastAPI – REST API framework
- RAG (Retrieval-Augmented Generation) – grading intelligence layer
- LLMs – experimented with GPT, OpenAI models, LLaMA, and finally Claude (selected for best accuracy)
- PostgreSQL – database
- SQL + Migrations – Alembic for schema versioning
- Vector Search – document embeddings for retrieval
- Parallel Processing – multi-RAG inference for faster grading
- Auth + Token Management – JWT/Auth workflows
GradeLens aims to automate grading by comparing student answers with instructor-provided materials and rubrics. The system ingests PDFs, converts them into embedding-based vector stores, retrieves relevant content, and evaluates student responses using LLMs combined with rubric rules.
This ensures:
- fast grading
- unbiased scoring
- detailed feedback
- scalable performance with parallel grading
This document explains the full engineering evolution of the GradeLens AI Grader, from early prototypes to the final scalable architecture. Five different approaches were explored, evaluated, and refined. Only the fifth approach proved stable, consistent, and efficient enough for real-world use.
At the end, this document also includes the Pi-Scorer model evaluation step that helped determine the best-performing LLM settings.
Inject strictness as plain text into the grader prompt alongside:
- Question
- Student Response
- Rubrics
- Retrieved Context
- Strictness was not understood; the LLM gave inconsistent scoring.
- Lenient sometimes produced lower marks than moderate.
- Strictness text was too weak to control behavior.
Not reliable or consistent.
Include detailed strictness rules, strictness value, and all grading inputs.
Slight improvement over Approach 1.
- Still unstable between strictness levels.
- LLM did not reliably map rules into scoring behavior.
Better than Approach 1, but still unpredictable.
Split the system into:
- A Rubrics Agent that generates strictness-based rubrics.
- A Grader Agent that grades using those generated rubrics.
- Strict and lenient grading were more aligned.
- Moderate grading was reasonably stable.
- Very resource-heavy.
- Fairness issue: new rubrics were generated for each student.
- No guarantee the generated rubric quality was optimal.
- Reusing generated rubrics still did not solve reliability problems.
Promising concept, but impractical for production.
Create three separate agents:
- Lenient Grader
- Moderate Grader
- Strict Grader
Select agent based on input strictness.
More predictable than the previous approaches.
- Repeated recomputation of context and prompts.
- Expensive for multi-question exams.
- Some variation persisted between strictness levels.
Reasonable isolation, but not scalable.
Pre-build a dedicated grader for each exam question, save it, and reuse it for all student submissions.
- Professor creates the exam.
- Grader Initializer creates one grader per question, containing:
- Retrieved context chunks
- Rubrics
- Question text
- Strictness logic
- All graders are saved as a pickle file.
- Parallel Executor loads the graders and grades each student answer concurrently.
- The output is consistent JSON with score and feedback.
- No recomputation of context or rubrics.
- Consistent strictness behavior.
- Same grader used for all students (fairness).
- Very fast for multi-question exams.
- Supports parallelization.
- Lower token usage after initialization.
- Reproducible and stable.
This is the final and production-ready architecture for GradeLens.
To avoid guesswork, multiple model configurations were evaluated using Pi Labs’ Pi-Scorer, which scores the model across:
- Balanced Evaluation
- Constructive Feedback
- Content Recognition
- Logical Alignment
- Rubric Coverage
- Schema Compliance
- Prompt Fulfillment
- Professional Tone
- Note Grounding and Referencing
- Consistency
- Relevance Enforcement
Each trial used identical:
- Questions
- Rubrics
- Student responses
- Retrieved context
Only model parameters changed.
After several experiments, the best-performing configuration was:
- Temperature 0.7 provides balanced flexibility without hallucinations.
- Top_p 1.0 avoids collapsing into deterministic behavior.
- max_tokens 10000 is required due to the large grader prompt.
- Produced the most consistent strictness behavior.
- Achieved the highest overall Pi-Scorer performance.
This configuration was the strongest across repeated runs.
| Approach | Result |
|---|---|
| Approach 1 | Failed; strictness ignored |
| Approach 2 | Still inconsistent |
| Approach 3 | Too heavy; fairness issues |
| Approach 4 | Better but not scalable |
| Approach 5 | Final architecture; consistent and efficient |
Pi-Scorer was essential in selecting the optimal LLM configuration to ensure consistent and accurate grading behavior. You can view the pi scores that was used to find the best model here: https://github.com/vemuladevendran/GradeLens-Backend/blob/main/RAG%20DEVELOPMENT/data/outputs/pi_scores/scoring_forstudent_answer_correct_a.csv
Below is a detailed, documented progression of the project from initial setup to final implementation.
- Set up basic FastAPI structure
- Built the RAG pipeline prototype
- Added PDF ingestion
- Implemented text splitting + chunking
- Stored chunks in PostgreSQL / vector DB
- Implemented retrieval logic for matching student answers with reference content
- Verified correctness using small test documents
- Added processing steps:
- improved chunking
- better text cleaning
- metadata tagging of chunks
- Started testing grading prompts
- Experimented with multiple LLMs:
- GPT
- OpenAI models
- LLaMA
- Claude (Anthropic)
- Tested different:
- temperatures
- max tokens
- formatting styles
- grading templates
- Conclusion: Claude performed the best in accuracy and response consistency
- Added rubric evaluation logic
- Created API and schema for:
- questions
- answer keys
- rubric templates
- Implemented strictness levels (multiple attempts)
- After 2–3 weeks of trials, strictness logic was unstable → temporarily removed
- Standardized grading prompt format
- Improved RAG context retrieval
- Set up Alembic migration scripts
- Created tables for:
- users
- courses
- notes
- assessments
- exams
- student submissions
- grades
- Added CRUD APIs:
- user creation + authentication
- course creation, update, delete
- uploading and editing notes
- creating assessments and exams
- student registration
- submission endpoints
- Connected all backend APIs to RAG pipeline
- Implemented grading flow:
- student submits exam
- backend retrieves correct content
- RAG evaluates each answer
- feedback + score returned
- grades stored in DB
- Added endpoints for:
- editing submissions
- viewing grades
- updating course/assessment/exam details
- deleting notes
- Fixed multiple bugs related to:
- chunking issues
- retrieval mismatches
- token limits
- database joins
- Designed a parallel RAG system
- If a student submits 4 questions → system runs 4 RAG processes simultaneously
- Result:
- huge reduction in grading time
- more scalability for large exams
- Added performance logging
- Final end-to-end testing completed
- End-to-end grading pipeline
- Complete REST API suite
- Course/notes/exam workflow
- Parallel grading
- Cloud LLM integration
- Migrations + stable DB schema
- Re-implement strictness level (more refined version)
- Additional rubric customization
- Plagiarism detection module
- Multi-model fallback system
- Instructor uploads PDF notes
- System chunks + embeds the content
- Students answer exam questions
- For each answer, RAG retrieves relevant context
- LLM grades based on rubric + reference materials
- Parallel processing speeds up grading
- Grades + feedback stored in DB
- Devendran Vemula – Backend, Frontend
- Srinivasan Poonkundran – RAG development, Backend APIs Integration
- Tejasree Nimmagadda – Document Preparation, Data Analysis




