Work in progress...
The Anime Web Scraper is a script that downloads images of previews of episodes and other contents from official anime websites.
The program is structured by the Download classes, which contains logic to scrape the websites. The name of the Download classes ends with Download.
There are three inheritance levels for the structure of the program:
MainDownload- The season classes (e.g.
Winter2020AnimeDownload), andExternalDownload. - The anime classes (e.g.
TateNoYuushaDownload) inherited from the season classes, and the external download classes (e.g.AniverseMagazineDownload) inherited fromExternalDownload.
All Download classes must be instantiated in order to use the methods in it, unless the method is a class method. In most cases, the classes must be instantiated.
All Download classes inherits from the MainDownload classes which contains attributes and methods used by most Download classes. The class is found in the main_download.py file.
| Name | Type | Description |
|---|---|---|
| folder_name | str | Name of the folder where files are to be saved |
| enabled | bool | If true, the anime classes will show up in search results. Default value is True. |
| image_list | list(dict) | List containing dictionary with image information (url, name). Not to be called directly. |
| keywords | list(str) | Keywords used as filters to search for the anime classes. |
| season | str | The season of the anime when it first started airing. Format: Year-Season, where Season: 1 - Winter, 2 - Spring, 3 - Summer, 4 - Fall. E.g. Winter 2020 is 2020-1) |
| season_name | str | Name of the season. E.g. Winter 2020 |
| title | str | Anime title. Used by the anime classes. |
Main method to run the scraper logic in the class. All anime classes and external download classes must use this method.
Downloads image based on the image URL to the specific filepath without extension. This method also creates a log in the anime class's log folder, as well as download_log.tsv in the out folder.
| Name | Type | Required | Description |
|---|---|---|---|
| url | str | Yes | URL of the image |
| filepath_without_extension | str | Yes | Filepath of the image to be saved to without extension |
| headers | dict | No | Headers used for HTTP GET request. Default value: None |
| to_jpg | bool | No | Converts image downloaded to jpg if the image is of type image\webp. Default value: False |
| is_mocanews | bool | No | Specify this to true if the image is from MocaNews website. Default value: False |
| min_width | int | No | Downloads the image if its width is >= min_width. Default value: None |
Checks if the website has updated by comparing the response size saved. Prints a message if updated and creates a log and a copy of the webpage in the log folder of the class.
| Name | Type | Required | Description |
|---|---|---|---|
| url | str | Yes | URL of the website |
| cache_name | str | No | Name of the log containing the response size. Default value: story |
| headers | dict | No | Headers used for HTTP GET request. Default value: None |
| charset | str | No | Charset used by the website for decoding purposes. Default value: None |
| diff | int | No | Returns true if the difference between the size of the responses is greater than the value. Default value: 0 |
Gets the path of the folder of the anime classes and external download classes where the files are saved. The filepath are based on the folder_name of the class and its parent classes.
For example, TateNoYuushaDownload class inherits from Winter2019AnimeDownload class, which also inherits from MainDownload class:
| Class | folder_name |
|---|---|
| TateNoYuushaDownload | tate-no-yuusha |
| Winter2019AnimeDownload | 2019-1 |
| MainDownload | download |
Executing TateNoYuushaDownload.get_full_path() will return download/2019-1/tate-no-yuusha.
Parse HTTP response of the url provided and returns a BeautifulSoup object.
| Name | Type | Required | Description |
|---|---|---|---|
| url | str | Yes | URL of the website |
| headers | dict | No | Headers used for HTTP GET request. Default value: None |
| decode | bool | No | If true, decode the HTTP response. Default value: False |
Parse HTTP response into JSON object
| Name | Type | Required | Description |
|---|---|---|---|
| url | str | Yes | URL of the website containing JSON |
| headers | dict | No | Headers used for HTTP GET request. Default value: None |