Skip to content

samir650/Book-Paraphrasing-AR-EN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Paraphrasing Arabic and English PDF Books

Overview

This Flask-based web application provides an API endpoint to paraphrase text extracted from PDF files. It supports both Arabic and English languages. The application automatically detects the language of the uploaded PDF, processes the text accordingly, and returns the paraphrased content.

Features

  • Language Detection: Automatically detects if the PDF content is in Arabic or English.
  • Text Extraction: Extracts text from PDF files using pdfplumber.
  • Text Cleaning: Cleans the extracted text by removing URLs, numbers, special characters, and extra spaces.
  • Semantic Chunking: Divides text into semantically coherent chunks for better paraphrasing.
  • Paraphrasing:
    • English: Uses t5-base model for paraphrasing.
    • Arabic: Uses google/mt5-base model for paraphrasing.
  • Arabic Text Handling: Corrects Arabic text direction and reshapes it for proper display.
  • RESTful API: Provides a /paraphrase endpoint to upload PDFs and receive paraphrased text.

Table of Contents

Requirements

  • Python 3.6 or higher
  • CUDA-enabled GPU (recommended for performance)
  • pip package manager

Installation

  1. Clone the Repository

    git clone https://github.com/yourusername/multilingual-pdf-paraphraser.git
    cd multilingual-pdf-paraphraser
  2. Create a Virtual Environment

    It's recommended to use a virtual environment to manage dependencies.

    python -m venv venv
    source venv/bin/activate  # On Windows use: venv\Scripts\activate
  3. Install Dependencies

    pip install -r requirements.txt
  4. Download NLTK Data

    import nltk
    nltk.download('punkt')
  5. Download Stanza Models

    import stanza
    stanza.download('ar')

Usage

Running the Application

Start the Flask application:

python app.py

The application will run on http://127.0.0.1:5000/ by default.

API Endpoint

POST /paraphrase

  • Description: Accepts a PDF file and returns the paraphrased text.
  • Content-Type: multipart/form-data
  • Form Data:
    • file: The PDF file to be paraphrased.

Example Request using curl:

curl -X POST -F 'file=@/path/to/your/file.pdf' http://127.0.0.1:5000/paraphrase

Example Response:

{
  "summary": "Your paraphrased text here."
}

Error Handling:

  • 400 Bad Request: Missing file or invalid input.
  • 500 Internal Server Error: Error during processing.

Code Structure

  • app.py: Main application file containing the Flask app and all functions.
  • Functions:
    • extract_text_from_pdf(pdf_path): Extracts text from a PDF.
    • fix_arabic_text(text): Fixes the direction and reshapes Arabic text.
    • clean_text(text): Cleans English text.
    • clean_arabic_text(text): Cleans Arabic text.
    • divide_by_semantics_with_length(text, ...): Chunks English text semantically.
    • chunk_arabic_text(text, ...): Chunks Arabic text.
    • paraphrase_chunks_en(chunks, ...): Paraphrases English text chunks.
    • paraphrase_chunks_ar(chunks, ...): Paraphrases Arabic text chunks.
    • generate_txt(summary_text, ...): Generates a text file with the paraphrased content.
    • paraphrase_english(book_text, ...): Pipeline for English paraphrasing.
    • paraphrase_arabic(pdf_path, ...): Pipeline for Arabic paraphrasing.
    • detect_language_and_paraphrase(pdf_path, ...): Detects language and triggers the appropriate pipeline.

Models Used

  • English Paraphrasing:
    • Model: t5-base
    • Library: transformers
  • Arabic Paraphrasing:
    • Model: google/mt5-base
    • Library: transformers
  • Semantic Similarity:
    • Model: sentence-transformers/all-MiniLM-L6-v2
    • Library: sentence-transformers

Notes

  • Performance: Using a GPU is highly recommended for handling large documents and improving paraphrasing speed.
  • Limitations: Paraphrasing quality may vary depending on the complexity of the input text.
  • Customization: Parameters like chunk size and paraphrasing models can be adjusted for better results.

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

About

Paraphrasing Arabic and English Books Pipelines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors