Skip to content

AI chatbot evaluation benchmark for mental health and suicide prevention with rule-based ethical alignment, inclusivity scoring, and sentiment analysis. Research paper submitted to AI & Society (Springer Nature) under review.

License

ZhaoJackson/Benchmark_Evaluation_Suicide_Prevention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

43 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Evaluating Trust and Inclusivity: A Machine-Driven Benchmark for Large Language Model Chatbots in LGBTQ+ Suicide Prevention

๐Ÿ“„ Research Paper Repository - Submitted to AI & Society (Springer Nature)


๐ŸŒ Web Application and Pipeline Development

๐Ÿ”— Interactive Benchmark Tool: http://crmforrealty.com/

We are developing the gAyl BENCHMARK TOOL web application that extends this research for broader accessibility:

  • โš–๏ธ Ethical Analysis: Interactive LGBTQ+ inclusivity assessment
  • ๐ŸŒˆ Inclusivity Metrics: Real-time diversity evaluation
  • ๐Ÿ“Š Text Complexity: Dynamic readability analysis
  • ๐Ÿ’ญ Sentiment Analysis: Advanced emotional tone evaluation

๐Ÿš€ Web Application Repository:

https://github.com/ZhaoJackson/AI_Response_Evaluation_Benchmark

This companion repository contains the automated web application benchmark that implements our evaluation pipeline with:

  • ๐ŸŒ Flask Web Interface: Interactive evaluation platform
  • ๐Ÿ”Œ REST API Endpoints: Programmatic access to evaluation functions
  • ๐Ÿ’พ Automatic Database: CSV-based data collection and tracking
  • ๐Ÿ“Š Real-time Statistics: Live evaluation metrics and history
  • ๐Ÿ”„ Reinforcement Learning Integration: Automated data collection for model improvement

The web application serves as a practical implementation of our research pipeline, enabling:

  • Live Chatbot Testing: Real-time evaluation of AI responses
  • Data Collection: Automated database building for future research
  • Community Access: Broader accessibility to evaluation tools
  • Pipeline Improvement: Continuous enhancement based on usage data

For collaboration and web application access: Contact Zichen Zhao (zz3119@columbia.edu)


๐Ÿ“‹ Paper Overview

This repository contains the complete implementation and evaluation system for our research paper examining AI chatbot effectiveness in LGBTQ+ mental health and suicide prevention contexts. The study compares AI-generated responses to expert-crafted human references across six comprehensive metrics: lexical overlap, semantic similarity, ethical alignment, emotional tone, cultural inclusivity, and communication accessibility. Our goal is to ensure AI chatbots provide supportive, unbiased, and ethically sound assistance for vulnerable LGBTQ+ populations in crisis situations.

Motivation

With mental health chatbots increasingly being used in healthcare, it is vital that they respond with sensitivity, particularly toward vulnerable populations like LGBTQ+ individuals. This project evaluates AI responses in critical mental health scenarios to identify areas where AI responsiveness and empathy can improve. This evaluation highlights gaps in chatbot response quality to foster advancements in AI support for LGBTQ+ mental health.

Evaluation Pipeline Flowchart

Evaluation Flowchart

๐Ÿ—๏ธ Repository Structure and Workflow Guide

Complete File Structure:

Text-Reference-AIChatbot/
โ”œโ”€โ”€ main.py                          # ๐Ÿš€ Main execution script - START HERE
โ”œโ”€โ”€ requirements.txt                 # ๐Ÿ“ฆ Python dependencies
โ”œโ”€โ”€ LICENSE                          # ๐Ÿ“œ Academic research license
โ”œโ”€โ”€ README.md                        # ๐Ÿ“– This overview document
โ”œโ”€โ”€ .gitignore                       # ๐Ÿ”’ Git ignore configuration
โ”‚
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ commonconst.py              # โš™๏ธ System constants (214 parameters)
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ data/                       # ๐Ÿ“ Input data and processing
โ”‚   โ”‚   โ”œโ”€โ”€ data_processing.py      # ๐Ÿ”„ DOCX โ†’ CSV conversion
โ”‚   โ”‚   โ”œโ”€โ”€ Test Reference Text.docx # ๐Ÿ‘ค Human expert responses
โ”‚   โ”‚   โ””โ”€โ”€ Test Chatbot text.docx  # ๐Ÿค– 11 AI chatbot responses
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ utils/                      # ๐Ÿงฎ Core evaluation system
โ”‚   โ”‚   โ”œโ”€โ”€ evaluation_algo.py      # ๐Ÿ“Š 6 evaluation algorithms
โ”‚   โ”‚   โ”œโ”€โ”€ weights.py              # โš–๏ธ Weight justification (703 lines)
โ”‚   โ”‚   โ””โ”€โ”€ user_guide.py           # ๐Ÿ“š Complete implementation guide
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ outputs/                    # ๐Ÿ“ˆ Generated results
โ”‚       โ”œโ”€โ”€ processed_*.csv         # ๐Ÿ”„ Structured data files
โ”‚       โ”œโ”€โ”€ evaluation_scores.csv   # ๐ŸŽฏ Final evaluation results
โ”‚       โ””โ”€โ”€ Plots/                  # ๐Ÿ“Š 6 visualization charts
โ”‚           โ”œโ”€โ”€ ethical_alignment_score.png
โ”‚           โ”œโ”€โ”€ inclusivity_score.png
โ”‚           โ”œโ”€โ”€ sentiment_distribution_score.png
โ”‚           โ””โ”€โ”€ [3 more charts]

๐Ÿš€ Quick Start Workflow (5 Minutes):

  1. ๐Ÿ“ฅ Clone and Setup:

    git clone https://github.com/ZhaoJackson/Text-Reference-AIChatbot.git
    cd Text-Reference-AIChatbot
    python -m venv venv && source venv/bin/activate  # Create virtual environment
  2. ๐Ÿ“ฆ Install Dependencies:

    pip install -r requirements.txt
    python -c "import nltk; nltk.download('punkt'); nltk.download('cmudict')"
  3. ๐Ÿƒ Run Evaluation:

    python main.py  # Complete pipeline execution (~2-3 minutes)
  4. ๐Ÿ“Š View Results:

    # Check evaluation scores
    head -5 src/outputs/evaluation_scores.csv
    
    # View generated charts
    ls src/outputs/Plots/*.png

๐Ÿ“š Understanding the Workflow:

Phase 1: Data Processing

  • src/data/data_processing.py extracts text from DOCX files
  • Creates structured CSV files for systematic evaluation
  • Aggregates multiple response fragments into complete responses

Phase 2: Evaluation Pipeline

  • src/utils/evaluation_algo.py runs 6 evaluation algorithms
  • Each algorithm uses parameters from src/commonconst.py
  • Generates comprehensive scoring matrix for all chatbots

Phase 3: Results and Visualization

  • src/outputs/output_processing.py creates comparative charts
  • Final scores saved in evaluation_scores.csv
  • Visual analysis available in Plots/ directory

๐Ÿ” Deep Dive Resources:

  • ๐Ÿ“– Complete User Guide: src/utils/user_guide.py (1,300+ lines of implementation guidance)
  • โš–๏ธ Weight Justifications: src/utils/weights.py (Clinical rationale for all parameters)
  • ๐Ÿงฎ Algorithm Details: src/utils/evaluation_algo.py (Detailed implementation with comments)

Methodology

1. Data Preprocessing

  • data_processing.py: Extracts structured data from .docx files and converts chatbot/human responses into clean CSV format for analysis.

2. Six-Metric Evaluation System (in evaluation_algo.py)

Our comprehensive evaluation system assesses each chatbot response across six professional competency dimensions:

Metric Range Function Clinical Purpose
ROUGE Score 0โ€“1 calculate_average_rouge() Lexical overlap with expert responses - ensures coverage of critical topics
METEOR Score 0โ€“1 calculate_meteor() Semantic similarity with synonym awareness - evaluates empathetic language variation
Ethical Alignment 0โ€“1 evaluate_ethical_alignment() Rule-based professional competency assessment across 6 components (LGBTQ+ 25%, Crisis 20%, Social Work 20%, etc.)
Sentiment Distribution 0โ€“1 evaluate_sentiment_distribution() Emotional tone alignment using DistilRoBERTa with therapeutic weighting
Inclusivity Score โ‰ฅ0 evaluate_inclusivity_score() LGBTQ+ affirming language with hierarchical scoring (Core: 4pts, Secondary: 2.5pts)
Complexity Score ~20-80 evaluate_complexity_score() Crisis-modified Flesch-Kincaid for accessibility during emotional distress

๐ŸŽฏ Expected Results After Running python main.py:

  • Ethical Alignment: 0.61-0.89 (meaningful professional differentiation)
  • Inclusivity: 0.00-0.42 (variable LGBTQ+ competency)
  • Sentiment Distribution: 0.04-1.00 (diverse emotional alignment)
  • ROUGE/METEOR: 0.19-0.36 (moderate similarity ranges)
  • Complexity: 49-61 (appropriate crisis accessibility)
  • Visualizations: 6 comparative bar charts generated automatically

Chatbots Evaluated

General-Purpose LLMs:

  • ChatGPT-4
  • Claude (Anthropic)
  • Gemini (Google)
  • LLaMA-3 (Meta)
  • DeepSeek
  • Mistral
  • Perplexity AI
  • HuggingChat

LGBTQ+-Specific Chatbots:

  • JackAI
  • Gender Journey Chatbot Rubies

These platforms were selected for their relevance in AI ethics, mental health, and LGBTQ+ inclusivityโ€”ensuring both high-tech LLMs and community-centric tools are evaluated under equal standards.


๐Ÿ“ˆ Key Research Findings

Top Performers by Professional Competency:

Rank Chatbot Ethical Alignment Key Strengths
1 DeepSeek 0.89 Exceptional LGBTQ+ competency, comprehensive crisis assessment
2 Mistral AI 0.88 Strong professional practice, good crisis focus
3 HuggingChat 0.85 Solid overall competency, appropriate questioning
11 Claude 0.61 Limited LGBTQ+ focus, basic crisis assessment only

Metric Range Analysis:

  • Ethical Alignment: 0.61โ€“0.89 โ†’ Meaningful professional differentiation achieved
  • Inclusivity: 0.00โ€“0.42 โ†’ Significant gaps in LGBTQ+ affirming language
  • Sentiment Distribution: 0.04โ€“1.00 โ†’ Diverse emotional intelligence capabilities
  • ROUGE/METEOR: 0.19โ€“0.36 โ†’ Moderate lexical/semantic similarity to expert responses
  • Complexity: 49โ€“61 โ†’ Appropriate accessibility for crisis communication

Critical Observations:

  • Professional Competency Varies Significantly: 28-point spread in ethical alignment scores
  • LGBTQ+ Competency Gaps: Most chatbots lack specialized identity-affirming language
  • Crisis Assessment Quality: Strong variation in suicide risk assessment capabilities
  • Accessibility Consistency: All chatbots maintain appropriate readability for crisis contexts

Results Interpretation

Metric Insight
ROUGE / METEOR High = better alignment with human phrasing.
Ethical Alignment High = more safety-conscious, affirming language.
Inclusivity High = uses LGBTQ+-affirming terms, avoids harm.
Sentiment High = tone matches supportive reference.
Complexity Mid-range ideal; too low = vague, too high = overly complex.

๐Ÿš€ Future Research and Development

Active Development:

  • ๐ŸŒ Web Application: AI_Response_Evaluation_Benchmark provides automated evaluation platform
  • ๐Ÿ”„ Reinforcement Learning: Automated data collection for continuous pipeline improvement
  • ๐Ÿ“Š Real-time Evaluation: Interactive assessment capabilities through Flask web interface
  • ๐Ÿ’พ Database Integration: Automatic CSV tracking for longitudinal analysis

Research Pipeline Enhancement:

  • Enhanced LGBTQ+ Competency: Specialized training recommendations based on evaluation gaps
  • Clinical Integration: Direct implementation in therapeutic settings via web platform
  • Multilingual Support: Spanish and other language evaluation capabilities
  • Community Collaboration: Open research partnerships through automated benchmark tool

Technical Innovation:

  • API Integration: RESTful endpoints for programmatic access to evaluation functions
  • Automated Data Collection: Continuous database building for model improvement
  • Statistical Analysis: Real-time metrics and evaluation history tracking
  • Scalable Architecture: Web-based platform for broader research community access

๐Ÿ“ž Contact and Collaboration

Lead Researcher:

Zichen Zhao
๐Ÿ“ง zz3119@columbia.edu
๐Ÿ”ฌ AI Ethics in Mental Health and AI Technology Studies

Web Application Development:

Sam Abdella - gAyl BENCHMARK TOOL
๐Ÿ“ง sn3136@columbia.edu
๐ŸŒ http://crmforrealty.com/

Faculty Supervision:

Prof. Elwin Wu (elwin.wu@columbia.edu)
Prof. Charles Lea (chl2159@columbia.edu)


๐Ÿ“œ License and Citation

License: MIT Academic Research License (see LICENSE)
Usage: Free for academic research, citation required, commercial use restricted

Citation:

@misc{zhao2025chatbot,
  title={Evaluating Trust and Inclusivity: A Machine-Driven Benchmark for Large Language Model Chatbots in LGBTQ+ Suicide Prevention},
  author={Zhao, Zichen},
  year={2025},
  url={https://github.com/ZhaoJackson/Text-Reference-AIChatbot},
  note={Submitted to AI \& Society (Springer Nature). Web application: http://crmforrealty.com/. Implementation: https://github.com/ZhaoJackson/AI_Response_Evaluation_Benchmark}
}

๐Ÿ”— Related Repositories and Resources

๐Ÿ“„ Paper Repository: https://github.com/ZhaoJackson/Text-Reference-AIChatbot (This repository)
๐ŸŒ Web Application: https://github.com/ZhaoJackson/AI_Response_Evaluation_Benchmark
๐Ÿ”— Live Demo: http://crmforrealty.com/
๐Ÿ“„ Paper Status: Under review at AI & Society (Springer Nature)

About

AI chatbot evaluation benchmark for mental health and suicide prevention with rule-based ethical alignment, inclusivity scoring, and sentiment analysis. Research paper submitted to AI & Society (Springer Nature) under review.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages