This is a screenplay parser that extracts dialogues between characters. However it extracts the dialogues if the second character has a paranthetical. The scripts are crawled from http://www.imsdb.com/ .
-
Create a new environment
-
Clone the repository
-
Install the dependencies
pip install -r requirements.txt -
Run scrapy : Go to brickset-scraper folder and run this in your terminal:
scrapy runspider scraper.py --output=data/names_links.jsonThis will generate
data/names_links.json. -
python json_parser.py data/names_links.json. This will readnames_links.jsonand will createall_name_script.txt. This new txt file has a movie name and a link to its script for each movie in the json file. Note that each script takes 1-2 seconds. -
python html_list_parser.py. This will readall_name_script.txtand will generateall_dialogues.txt. This file has all the relevant dialogues from the movie scripts.
You need to have
- BeautifulSoup
- Scraper
- Python 3 or above
- Jupyter Notebook
Kamil Veli Toraman: kvtoraman
There is no licence for now. You can use as you please. This code tries to have a rule-based algorithm for movie scripts. If you have a better way, please inform me :)
- This is a result of a 2 month internship in Data Science Lab, Kaist.