Detailed information at DeepWiki
A multilingual fact-checking system powered by CrewAI, designed to verify statements using NLP techniques.
- Multilingual Support: Fact-check and translate content across multiple languages
- Semantic Search: Embedding-based search for relevant information
- Confidence Scoring: Confidence calculation for fact verification
- Flexible Fact-Checking: Supports both Wikipedia and internet-based fact-checking
NLP_Fact_checker/
βββ agents/ # AI agent definitions
β βββ fact_verifier_agent.py # Handles core logic for fact validation
β βββ input_analyser_agent.py # Processes and interprets input queries
β βββ internet_searcher_agent.py # Performs web-based information retrieval
β βββ meta_searcher_agent.py # Searches the articles contents based on the article name
β βββ searcher_agent.py # General-purpose information search
β βββ summarizer_agent.py # Generates concise content summaries
β βββ translator_agent.py # Handles multilingual translation
β
βββ corpus/ # Embeddings and document storage
β βββ embeddings/ # Pre-computed vector embeddings
β βββ documents/ # Source documents and reference materials
β
βββ crews/ # Agent collaboration configurations
β βββ fact_checker_crew.py # Coordinates Wikipedia-based fact verification
β βββ generic_translation_crew.py # Handles generic text translation
β βββ input_analyzer_crew.py # Processes and analyzes input queries
β βββ internet_fact_checker_crew.py # Coordinates internet-based fact-checking
β βββ meta_search_crew.py # Manages metadata and source searching
β βββ translation_crew.py # Handles translations with structured output
β
βββ flows/ # Workflow management
β βββ fact_checker_flow.py # Wikipedia-based fact-checking workflow
β βββ internet_fact_checker_flow.py # Internet-based fact-checking workflow
β βββ get_summarized_source_flow.py # Source content summarization
β
βββ tasks/ # Specific task implementations
β βββ fact_verification_task.py # Core fact-checking task
β βββ input_analysis_task.py # Input query processing
β βββ internet_search_task.py # Web-based information retrieval
β βββ metadata_search_task.py # Metadata and source information search
β βββ search_task.py # General search functionality
β βββ summarize_task.py # Content summarization
β βββ translation_task.py # Language translation implementation
β
βββ tools/ # Search and utility tools
β βββ search_manager.py # Singleton vector store management
β βββ search_tools.py # RAG and metadata search capabilities
β
βββ utils/ # Utility modules
β βββ embeddings.py # Singleton embeddings management
β
βββ web/ # Web interface components
β βββ components/ # Reusable UI components
β βββ static/ # Static assets (CSS, images)
β βββ templates/ # HTML templates
β
βββ .env # Environment configuration
βββ requirements.txt # Project dependencies
βββ main.py # Application entry point
-
agents/: Contains specialized AI agents responsible for specific tasks- Each agent is designed with a single responsibility principle
- Modular design allows easy extension and modification
-
corpus/: Stores knowledge base and computational resourcesembeddings/: Pre-computed vector representations for fast semantic searchdocuments/: Reference materials and source documents
-
crews/: Defines collaborative workflows for complex tasks- Orchestrates multiple agents to achieve comprehensive goals
- Implements different fact-checking and translation strategies
-
flows/: Manages end-to-end workflows- Defines the sequence of operations for different fact-checking scenarios
- Handles state management and inter-agent communication
- Enables the integration of AI-driven and traditional programming approaches
-
tasks/: Granular task implementations- Breaks down complex operations into manageable, focused tasks
- Supports modular and reusable task design
-
tools/: Provides utility functions and search capabilities- Implements singleton patterns for resource management
- Offers advanced search and embedding techniques
-
utils/: Contains core utility modules- Provides singleton embedding management
- Ensures consistent resource initialization
-
web/: Web interface components- Supports potential web-based frontend
- Separates UI concerns from core logic
Our specialized AI agents handle different aspects of fact-checking:
fact_verifier_agent.py: Validates and cross-references factsinput_analyser_agent.py: Processes and interprets input queriesinternet_searcher_agent.py: Performs web-based information retrievalmeta_searcher_agent.py: Searches metadata and source informationsearcher_agent.py: General-purpose information searchsummarizer_agent.py: Generates concise summariestranslator_agent.py: Handles language translation
Collaborative agent teams that coordinate complex tasks:
fact_checker_crew.py: Manages fact verification workflowgeneric_translation_crew.py: Handles generic text translationinput_analyzer_crew.py: Processes and analyzes input queriesinternet_fact_checker_crew.py: Coordinates internet-based fact-checkingmeta_search_crew.py: Manages metadata and source searchingtranslation_crew.py: Specialized translation coordination
Specific task implementations for different workflow stages:
fact_verification_task.py: Core fact-checking logicinput_analysis_task.py: Input query processinginternet_search_task.py: Web-based information retrievalmetadata_search_task.py: Metadata and source information searchsearch_task.py: General search functionalitysummarize_task.py: Content summarizationtranslation_task.py: Language translation implementation
Utility functions and search management:
search_manager.py: Manages vector store and embedding resourcessearch_tools.py: Provides RAG and metadata search capabilities
Workflow management for different fact-checking scenarios:
fact_checker_flow.py: Wikipedia-based fact-checking workflowget_summarized_source_flow.py: Source content summarizationinternet_fact_checker_flow.py: Internet-based fact-checking workflow
The corpus/ directory is a critical component of the fact-checking system, responsible for managing knowledge sources and embeddings:
corpus/
βββ zipped/ # Original compressed source files
βββ unzipped/ # Extracted source documents
βββ embeddings/ # Generated vector embeddings
βββ create_embeddings.py # Script to process and vectorize documents
βββ unzip_files.py # Utility to extract compressed files
The corpus is a critical component of the fact-checking system and is responsible for providing knowledge sources and embeddings. To create your own corpus, follow these steps:
The corpus currently supports Wikipedia XML dumps as source documents. You can download the XML dumps from the following sources:
The corpus must meet the following requirements:
- The corpus must be a Wikipedia XML dump (files that contains article text, like this one enwiki-20241001-pages-articles1.xml-p1p41242.bz2)
- Currently, only English language sources are supported to be used as corpus
# Place bz2 files containing the XML in corpus/zipped/. It accepts any number of files.
python corpus/unzip_files.py # Extract compressed files
python corpus/create_embeddings.py # Generate embeddings- Extract text from source documents
- Clean and normalize text
- Split into manageable chunks
- Generate vector embeddings
- Store in FAISS vector database
The .env file contains essential configuration for the application:
# OpenAI API Configuration
OPENAI_API_KEY="your_openai_api_key" # Required for AI-powered fact-checking
OPENAI_MODEL_NAME="gpt-4o-mini" # Specify the OpenAI model to use
# Serper API Configuration (for web searches)
SERPER_API_KEY="your_serper_api_key" # Optional: API key for web search functionality-
OpenAI API Key:
- Mandatory for using AI-powered fact-checking
- Obtain from OpenAI Platform
- Ensure the key has appropriate permissions for text generation and analysis
-
OpenAI Model Selection:
- Currently using
gpt-4o-mini - Provides a balance of performance and cost-effectiveness
- Can be changed to other compatible OpenAI models as needed
- Currently using
-
Serper API Key:
- Used for web-based searches
- Can be obtained from Serper.dev if web search is required
- Uses
all-MiniLM-L6-v2multilingual embedding model - FAISS vector store for efficient semantic search
- Calculates semantic similarity between query and retrieved fragments
- Uses cosine similarity to measure relevance
- Confidence is the maximum similarity score between query and fragments
- Ranges from 0.0 (no match) to 1.0 (perfect match)
- Provides a simple, interpretable confidence metric
- Automatic language detection
- Translation of queries and results
- Supports fact-checking in multiple languages
- On Windows the project requires Microsoft Visual C++ 14.0, from the C++ Build Tools. If the compiler is installed after Python, it is required to reinstall Python.
- Python version 3.10 (recommended) or later, but before version 3.12
git clone https://github.com/TheSOV/NLP_Fact_checker.git
cd NLP_Fact_checker
pip install -r requirements.txtpython main.py[MIT]