Overview This project builds a book recommendation system using data collected from multiple sources. The goal is to demonstrate an end-to-end data analytics workflow, including web scraping, API integration, data cleaning, exploratory analysis, and deployment.
The final system allows users to explore books and receive recommendations through an interactive web application.
A comprehensive book recommendation app that combines both literal and semantic search engines with TF-IDF and SBERT recommendation systems.
🔗 Streamlit App
Photo credit (https://www.pexels.com/de-de/foto/gestapelte-bucher-1333742/)
🎥 Project Presentation
- Search A (Literal): Direct keyword matching across all book fields
- Search B (Semantic): SBERT-based semantic understanding
- Combined System: Literal results first, then semantic results
- Recommendation A (TF-IDF): Weighted feature similarity (Author 3x, Title 2x, Subjects 2x, Language 1x)
- Recommendation B (SBERT): Semantic similarity understanding
- Combined System: TF-IDF results first, then SBERT recommendations
- Main Page: Search interface with filters
- Search Results: Combined search results with book selection
- Book Details + Recommendations: Selected book with personalized recommendations
- Install dependencies:
pip install -r requirements.txt- Run the app:
streamlit run app.pyThe app expects the book dataset at one of these locations:
../../data/clean/books_merged_clean.csv../data/clean/books_merged_clean.csvdata/clean/books_merged_clean.csvbooks_merged_clean.csv
On first run, the app will generate SBERT embeddings and save them as book_embeddings.npy for faster subsequent startups.
- Search: Type any query (title, author, topic, keyword)
- Filter: Use language and year filters in sidebar
- Explore: Click on books to see detailed recommendations
- Navigate: Use sidebar buttons to switch between views
- Frontend: Streamlit web interface
- Search: Combined literal + semantic search
- Recommendations: TF-IDF + SBERT hybrid approach
- Caching: Efficient model and data loading
- Error Handling: Robust error messages and fallbacks
| Dataset | Source | Purpose |
|---|---|---|
| openlibrary | https://openlibrary.org/subjects/awards | Core data for books and awards given |
-
Data collection (Scraping + API)
-
Data cleaning & deduplication
-
Exploratory analysis
-
Content-based recommendation logic
-
Deployment with Streamlit
-
Add user-rating or popularity data
-
Implement similarity using text descriptions (NLP)
-
Improve genre standardisation
-
Expand dataset beyond 1000 books
-
Deploy using a cloud hosting platform