This repository contains four scripts designed for efficient data processing, retrieval, and summarization:
- yt_scraper.py: Processes YouTube videos by searching, downloading transcripts, summarizing content, and storing metadata.
- web_scraper.py: Extracts content and links from a specified domain, generates summaries, and tracks progress.
- app.py: Implements a Retrieval-Augmented Generation (RAG) chatbot using Milvus for context retrieval and a custom API for generating responses.
- load_db.py: Processes text files, generates embeddings, and inserts them into a Milvus vector database for efficient retrieval.
- YouTube Scraper (
yt_scraper.py) - Web Scraper (
web_scraper.py) - RAG Chatbot (
app.py) - Milvus Data Loader (
load_db.py) - How to Run
- Future Enhancements
sys,time,threading,random,os,json,dotenv,requestsyoutube_transcript_apiyoutube_searchgoogleapiclient.discovery
YOUTUBE_API_KEY: Set in.envfile for YouTube API authentication.
OUTPUT_DIR: Directory for storing text data.SUMMARY_DIR: Directory for saving summaries.CACHE_FILE: File for processed video IDs.QUEUE_FILE: File for managing queued tasks.CONTEXT_WINDOW: Maximum context size for summarization chunks.
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(SUMMARY_DIR, exist_ok=True)data/
├── text_data/ # Stores text files with video details and transcripts.
├── summaries/ # Stores generated summaries.
├── processed_videos.json # Tracks processed video IDs.
├── queue.json # Tracks the current processing queue.
print_console_stats(): Displays live statistics.load_cache() / save_cache(): Manages processed video cache.load_queue() / save_queue(): Manages the processing queue.log_error(): Logs errors to a rotating list.
download_transcript(video_id): Fetches English transcripts.get_video_details(video_id): Retrieves video metadata.save_as_text(video_id, video_details, transcript, output_dir): Saves metadata and transcripts.search_and_download_videos(query, output_dir, max_videos, cached_video_ids, queue): Searches and downloads videos.
process_queue(queue): Processes queued files, splits text into chunks, summarizes them, and saves summaries.
split_into_chunks(text, chunk_size, overlap=500): Splits text into chunks.summarize(text): Summarizes content using a custom API.
- Reads topics from
input.txt. - Initializes caches and processing queue.
- Launches background threads:
- Queue Processor: Processes tasks in the queue.
- Statistics Display: Updates console stats.
- Iteratively:
- Searches videos for topics.
- Downloads metadata and transcripts.
- Queues files for summarization.
threading,requests,bs4,os,time,json,sys,re,dotenv
BASE_URL: Root URL for scraping.UNWANTED_PATH_SEGMENTS: Path segments to exclude.
OUTPUT_DIR: Directory for raw text.SUMMARY_DIR: Directory for summaries.CACHE_FILE: File for visited URLs.QUEUE_FILE: File for task management.CONTEXT_WINDOW: Character limit for content chunks.
data/website/
├── text_data/ # Stores scraped content.
├── summaries/ # Stores generated summaries.
├── processed_videos.json # Tracks visited URLs.
├── queue.json # Tracks the current processing queue.
is_same_domain(link, base_domain): Checks if a link belongs to the same domain.normalize_url(url): Removes fragments from a URL.filter_links_by_segments(links, base_domain, unwanted_segments): Filters unwanted links.load_cache() / save_cache(): Manages visited URLs.load_queue() / save_queue(): Manages the processing queue.log_error(): Logs errors.
get_all_links_and_text(url, base_domain): Fetches content and links.scrape_domain(base_url, output_dir, queue, max_pages=100, unwanted_segments=None, cache=None): Recursively scrapes pages.
- Loads configuration from
.env. - Initializes caches and queue.
- Launches background threads for processing queue and stats.
- Calls
scrape_domain()to:- Extract links and content.
- Save content and queue tasks.
pymilvus,requests,tkinter,json
- Retriever: Fetches context from Milvus.
- Generator: Generates responses using a custom API.
- ChatbotUI: Interactive chatbot GUI.
- Host:
127.0.0.1 - Port:
19530 - Collection:
embedded_texts - Dimension:
4096
- Embedding API:
http://127.0.0.1:11434/api/embed - Generation API:
http://127.0.0.1:11434/api/generate
- User Input: User enters a query.
- Context Retrieval:
- Query is embedded and sent to Milvus.
- Retrieves relevant texts.
- Response Generation:
- Constructs a prompt using the context.
- Sends to Generation API.
- Display Response: Shows in the GUI.
pymilvus,requests,os,tqdm
- MilvusHandler: Manages Milvus connections and data insertion.
- TextEmbeddingProcessor: Generates embeddings and splits text.
- DataLoader: Handles file operations.
- EmbeddingPipeline: Orchestrates data loading and insertion.
- Host:
127.0.0.1 - Port:
19530 - Collection:
embedded_texts - Dimension:
4096
- Embedding API:
http://127.0.0.1:11434/api/embed
- Place text files in
./data.
- Load text files.
- Split text into chunks.
- Generate embeddings using the API.
- Insert data into Milvus.
- Create an index for efficient retrieval.
- Install dependencies:
pip install -r requirements.txt
- Start required services (e.g., Milvus).
- Configure
.envfiles. - Run the desired script:
python yt_scraper.py python web_scraper.py python app.py python load_db.py
- Error Recovery: Retry failed API calls.
- Scalability: Batch processing for large datasets.
- Multi-Collection Support: Handle multiple datasets dynamically.
- Improved Indexing: Support advanced index types for Milvus.