This Flask-based web application provides an API endpoint to paraphrase text extracted from PDF files. It supports both Arabic and English languages. The application automatically detects the language of the uploaded PDF, processes the text accordingly, and returns the paraphrased content.
- Language Detection: Automatically detects if the PDF content is in Arabic or English.
- Text Extraction: Extracts text from PDF files using
pdfplumber. - Text Cleaning: Cleans the extracted text by removing URLs, numbers, special characters, and extra spaces.
- Semantic Chunking: Divides text into semantically coherent chunks for better paraphrasing.
- Paraphrasing:
- English: Uses
t5-basemodel for paraphrasing. - Arabic: Uses
google/mt5-basemodel for paraphrasing.
- English: Uses
- Arabic Text Handling: Corrects Arabic text direction and reshapes it for proper display.
- RESTful API: Provides a
/paraphraseendpoint to upload PDFs and receive paraphrased text.
- Python 3.6 or higher
- CUDA-enabled GPU (recommended for performance)
- pip package manager
-
Clone the Repository
git clone https://github.com/yourusername/multilingual-pdf-paraphraser.git cd multilingual-pdf-paraphraser -
Create a Virtual Environment
It's recommended to use a virtual environment to manage dependencies.
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install Dependencies
pip install -r requirements.txt
-
Download NLTK Data
import nltk nltk.download('punkt')
-
Download Stanza Models
import stanza stanza.download('ar')
Start the Flask application:
python app.pyThe application will run on http://127.0.0.1:5000/ by default.
- Description: Accepts a PDF file and returns the paraphrased text.
- Content-Type:
multipart/form-data - Form Data:
file: The PDF file to be paraphrased.
Example Request using curl:
curl -X POST -F 'file=@/path/to/your/file.pdf' http://127.0.0.1:5000/paraphraseExample Response:
{
"summary": "Your paraphrased text here."
}Error Handling:
- 400 Bad Request: Missing file or invalid input.
- 500 Internal Server Error: Error during processing.
- app.py: Main application file containing the Flask app and all functions.
- Functions:
extract_text_from_pdf(pdf_path): Extracts text from a PDF.fix_arabic_text(text): Fixes the direction and reshapes Arabic text.clean_text(text): Cleans English text.clean_arabic_text(text): Cleans Arabic text.divide_by_semantics_with_length(text, ...): Chunks English text semantically.chunk_arabic_text(text, ...): Chunks Arabic text.paraphrase_chunks_en(chunks, ...): Paraphrases English text chunks.paraphrase_chunks_ar(chunks, ...): Paraphrases Arabic text chunks.generate_txt(summary_text, ...): Generates a text file with the paraphrased content.paraphrase_english(book_text, ...): Pipeline for English paraphrasing.paraphrase_arabic(pdf_path, ...): Pipeline for Arabic paraphrasing.detect_language_and_paraphrase(pdf_path, ...): Detects language and triggers the appropriate pipeline.
- English Paraphrasing:
- Model:
t5-base - Library:
transformers
- Model:
- Arabic Paraphrasing:
- Model:
google/mt5-base - Library:
transformers
- Model:
- Semantic Similarity:
- Model:
sentence-transformers/all-MiniLM-L6-v2 - Library:
sentence-transformers
- Model:
- Performance: Using a GPU is highly recommended for handling large documents and improving paraphrasing speed.
- Limitations: Paraphrasing quality may vary depending on the complexity of the input text.
- Customization: Parameters like chunk size and paraphrasing models can be adjusted for better results.
Contributions are welcome! Please open an issue or submit a pull request on GitHub.