Evaluating Trust and Inclusivity: A Machine-Driven Benchmark for Large Language Model Chatbots in LGBTQ+ Suicide Prevention
๐ Research Paper Repository - Submitted to AI & Society (Springer Nature)
๐ Interactive Benchmark Tool: http://crmforrealty.com/
We are developing the gAyl BENCHMARK TOOL web application that extends this research for broader accessibility:
- โ๏ธ Ethical Analysis: Interactive LGBTQ+ inclusivity assessment
- ๐ Inclusivity Metrics: Real-time diversity evaluation
- ๐ Text Complexity: Dynamic readability analysis
- ๐ญ Sentiment Analysis: Advanced emotional tone evaluation
https://github.com/ZhaoJackson/AI_Response_Evaluation_Benchmark
This companion repository contains the automated web application benchmark that implements our evaluation pipeline with:
- ๐ Flask Web Interface: Interactive evaluation platform
- ๐ REST API Endpoints: Programmatic access to evaluation functions
- ๐พ Automatic Database: CSV-based data collection and tracking
- ๐ Real-time Statistics: Live evaluation metrics and history
- ๐ Reinforcement Learning Integration: Automated data collection for model improvement
The web application serves as a practical implementation of our research pipeline, enabling:
- Live Chatbot Testing: Real-time evaluation of AI responses
- Data Collection: Automated database building for future research
- Community Access: Broader accessibility to evaluation tools
- Pipeline Improvement: Continuous enhancement based on usage data
For collaboration and web application access: Contact Zichen Zhao (zz3119@columbia.edu)
This repository contains the complete implementation and evaluation system for our research paper examining AI chatbot effectiveness in LGBTQ+ mental health and suicide prevention contexts. The study compares AI-generated responses to expert-crafted human references across six comprehensive metrics: lexical overlap, semantic similarity, ethical alignment, emotional tone, cultural inclusivity, and communication accessibility. Our goal is to ensure AI chatbots provide supportive, unbiased, and ethically sound assistance for vulnerable LGBTQ+ populations in crisis situations.
With mental health chatbots increasingly being used in healthcare, it is vital that they respond with sensitivity, particularly toward vulnerable populations like LGBTQ+ individuals. This project evaluates AI responses in critical mental health scenarios to identify areas where AI responsiveness and empathy can improve. This evaluation highlights gaps in chatbot response quality to foster advancements in AI support for LGBTQ+ mental health.
Text-Reference-AIChatbot/
โโโ main.py # ๐ Main execution script - START HERE
โโโ requirements.txt # ๐ฆ Python dependencies
โโโ LICENSE # ๐ Academic research license
โโโ README.md # ๐ This overview document
โโโ .gitignore # ๐ Git ignore configuration
โ
โโโ src/
โ โโโ commonconst.py # โ๏ธ System constants (214 parameters)
โ โ
โ โโโ data/ # ๐ Input data and processing
โ โ โโโ data_processing.py # ๐ DOCX โ CSV conversion
โ โ โโโ Test Reference Text.docx # ๐ค Human expert responses
โ โ โโโ Test Chatbot text.docx # ๐ค 11 AI chatbot responses
โ โ
โ โโโ utils/ # ๐งฎ Core evaluation system
โ โ โโโ evaluation_algo.py # ๐ 6 evaluation algorithms
โ โ โโโ weights.py # โ๏ธ Weight justification (703 lines)
โ โ โโโ user_guide.py # ๐ Complete implementation guide
โ โ
โ โโโ outputs/ # ๐ Generated results
โ โโโ processed_*.csv # ๐ Structured data files
โ โโโ evaluation_scores.csv # ๐ฏ Final evaluation results
โ โโโ Plots/ # ๐ 6 visualization charts
โ โโโ ethical_alignment_score.png
โ โโโ inclusivity_score.png
โ โโโ sentiment_distribution_score.png
โ โโโ [3 more charts]
-
๐ฅ Clone and Setup:
git clone https://github.com/ZhaoJackson/Text-Reference-AIChatbot.git cd Text-Reference-AIChatbot python -m venv venv && source venv/bin/activate # Create virtual environment
-
๐ฆ Install Dependencies:
pip install -r requirements.txt python -c "import nltk; nltk.download('punkt'); nltk.download('cmudict')" -
๐ Run Evaluation:
python main.py # Complete pipeline execution (~2-3 minutes) -
๐ View Results:
# Check evaluation scores head -5 src/outputs/evaluation_scores.csv # View generated charts ls src/outputs/Plots/*.png
src/data/data_processing.pyextracts text from DOCX files- Creates structured CSV files for systematic evaluation
- Aggregates multiple response fragments into complete responses
src/utils/evaluation_algo.pyruns 6 evaluation algorithms- Each algorithm uses parameters from
src/commonconst.py - Generates comprehensive scoring matrix for all chatbots
src/outputs/output_processing.pycreates comparative charts- Final scores saved in
evaluation_scores.csv - Visual analysis available in
Plots/directory
- ๐ Complete User Guide:
src/utils/user_guide.py(1,300+ lines of implementation guidance) - โ๏ธ Weight Justifications:
src/utils/weights.py(Clinical rationale for all parameters) - ๐งฎ Algorithm Details:
src/utils/evaluation_algo.py(Detailed implementation with comments)
data_processing.py: Extracts structured data from.docxfiles and converts chatbot/human responses into clean CSV format for analysis.
Our comprehensive evaluation system assesses each chatbot response across six professional competency dimensions:
| Metric | Range | Function | Clinical Purpose |
|---|---|---|---|
| ROUGE Score | 0โ1 | calculate_average_rouge() |
Lexical overlap with expert responses - ensures coverage of critical topics |
| METEOR Score | 0โ1 | calculate_meteor() |
Semantic similarity with synonym awareness - evaluates empathetic language variation |
| Ethical Alignment | 0โ1 | evaluate_ethical_alignment() |
Rule-based professional competency assessment across 6 components (LGBTQ+ 25%, Crisis 20%, Social Work 20%, etc.) |
| Sentiment Distribution | 0โ1 | evaluate_sentiment_distribution() |
Emotional tone alignment using DistilRoBERTa with therapeutic weighting |
| Inclusivity Score | โฅ0 | evaluate_inclusivity_score() |
LGBTQ+ affirming language with hierarchical scoring (Core: 4pts, Secondary: 2.5pts) |
| Complexity Score | ~20-80 | evaluate_complexity_score() |
Crisis-modified Flesch-Kincaid for accessibility during emotional distress |
- Ethical Alignment: 0.61-0.89 (meaningful professional differentiation)
- Inclusivity: 0.00-0.42 (variable LGBTQ+ competency)
- Sentiment Distribution: 0.04-1.00 (diverse emotional alignment)
- ROUGE/METEOR: 0.19-0.36 (moderate similarity ranges)
- Complexity: 49-61 (appropriate crisis accessibility)
- Visualizations: 6 comparative bar charts generated automatically
- ChatGPT-4
- Claude (Anthropic)
- Gemini (Google)
- LLaMA-3 (Meta)
- DeepSeek
- Mistral
- Perplexity AI
- HuggingChat
- JackAI
- Gender Journey Chatbot Rubies
These platforms were selected for their relevance in AI ethics, mental health, and LGBTQ+ inclusivityโensuring both high-tech LLMs and community-centric tools are evaluated under equal standards.
| Rank | Chatbot | Ethical Alignment | Key Strengths |
|---|---|---|---|
| 1 | DeepSeek | 0.89 | Exceptional LGBTQ+ competency, comprehensive crisis assessment |
| 2 | Mistral AI | 0.88 | Strong professional practice, good crisis focus |
| 3 | HuggingChat | 0.85 | Solid overall competency, appropriate questioning |
| 11 | Claude | 0.61 | Limited LGBTQ+ focus, basic crisis assessment only |
- Ethical Alignment: 0.61โ0.89 โ Meaningful professional differentiation achieved
- Inclusivity: 0.00โ0.42 โ Significant gaps in LGBTQ+ affirming language
- Sentiment Distribution: 0.04โ1.00 โ Diverse emotional intelligence capabilities
- ROUGE/METEOR: 0.19โ0.36 โ Moderate lexical/semantic similarity to expert responses
- Complexity: 49โ61 โ Appropriate accessibility for crisis communication
- Professional Competency Varies Significantly: 28-point spread in ethical alignment scores
- LGBTQ+ Competency Gaps: Most chatbots lack specialized identity-affirming language
- Crisis Assessment Quality: Strong variation in suicide risk assessment capabilities
- Accessibility Consistency: All chatbots maintain appropriate readability for crisis contexts
| Metric | Insight |
|---|---|
| ROUGE / METEOR | High = better alignment with human phrasing. |
| Ethical Alignment | High = more safety-conscious, affirming language. |
| Inclusivity | High = uses LGBTQ+-affirming terms, avoids harm. |
| Sentiment | High = tone matches supportive reference. |
| Complexity | Mid-range ideal; too low = vague, too high = overly complex. |
- ๐ Web Application: AI_Response_Evaluation_Benchmark provides automated evaluation platform
- ๐ Reinforcement Learning: Automated data collection for continuous pipeline improvement
- ๐ Real-time Evaluation: Interactive assessment capabilities through Flask web interface
- ๐พ Database Integration: Automatic CSV tracking for longitudinal analysis
- Enhanced LGBTQ+ Competency: Specialized training recommendations based on evaluation gaps
- Clinical Integration: Direct implementation in therapeutic settings via web platform
- Multilingual Support: Spanish and other language evaluation capabilities
- Community Collaboration: Open research partnerships through automated benchmark tool
- API Integration: RESTful endpoints for programmatic access to evaluation functions
- Automated Data Collection: Continuous database building for model improvement
- Statistical Analysis: Real-time metrics and evaluation history tracking
- Scalable Architecture: Web-based platform for broader research community access
Zichen Zhao
๐ง zz3119@columbia.edu
๐ฌ AI Ethics in Mental Health and AI Technology Studies
Sam Abdella - gAyl BENCHMARK TOOL
๐ง sn3136@columbia.edu
๐ http://crmforrealty.com/
Prof. Elwin Wu (elwin.wu@columbia.edu)
Prof. Charles Lea (chl2159@columbia.edu)
License: MIT Academic Research License (see LICENSE)
Usage: Free for academic research, citation required, commercial use restricted
Citation:
@misc{zhao2025chatbot,
title={Evaluating Trust and Inclusivity: A Machine-Driven Benchmark for Large Language Model Chatbots in LGBTQ+ Suicide Prevention},
author={Zhao, Zichen},
year={2025},
url={https://github.com/ZhaoJackson/Text-Reference-AIChatbot},
note={Submitted to AI \& Society (Springer Nature). Web application: http://crmforrealty.com/. Implementation: https://github.com/ZhaoJackson/AI_Response_Evaluation_Benchmark}
}๐ Paper Repository: https://github.com/ZhaoJackson/Text-Reference-AIChatbot (This repository)
๐ Web Application: https://github.com/ZhaoJackson/AI_Response_Evaluation_Benchmark
๐ Live Demo: http://crmforrealty.com/
๐ Paper Status: Under review at AI & Society (Springer Nature)
