Skip to content

Commit d629bbb

Browse files
committed
Merge branch 'f/api'
1 parent e14d890 commit d629bbb

File tree

11 files changed

+471
-289
lines changed

11 files changed

+471
-289
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,3 +164,6 @@ cython_debug/
164164
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
165165
#.idea/
166166
.qodo
167+
168+
# custom
169+
test.py

README.md

Lines changed: 43 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,50 @@ This tool allows you to download content from the Wayback Machine (archive.org).
3838
<br>
3939
<br>
4040

41-
## Arguments
41+
## import
42+
43+
You can import pywaybackup into your own scripts and run it. Args are the same as cli.
44+
45+
Additional args:
46+
- `silent` (default True): If True, suppresses all output to the console.
47+
- `debug` (default False): If True, disables writing errors to the error log file.
48+
49+
```python
50+
from pywaybackup import PyWayBackup
51+
52+
backup = PyWayBackup(
53+
url="https://example.com",
54+
all=True,
55+
start="20200101",
56+
end="20201231",
57+
silent=False,
58+
debug=True,
59+
log=True,
60+
keep=True
61+
)
62+
63+
backup.run()
64+
backup_paths = backup.paths(rel=True)
65+
print(backup_paths)
66+
```
67+
output:
68+
```bash
69+
{
70+
'snapshots': 'output/example.com',
71+
'cdxfile': 'output/waybackup_example.cdx',
72+
'dbfile': 'output/waybackup_example.com.db',
73+
'csvfile': 'output/waybackup_https.example.com.csv',
74+
'log': 'output/waybackup_example.com.log',
75+
'debug': 'output/waybackup_error.log'
76+
}
77+
```
78+
79+
## cli
4280

4381
- `-h`, `--help`: Show the help message and exit.
4482
- `-v`, `--version`: Show information about the tool and exit.
4583

46-
### Required
84+
#### Required
4785

4886
- **`-u`**, **`--url`**:<br>
4987
The URL of the web page to download. This argument is required.
@@ -68,8 +106,8 @@ This tool allows you to download content from the Wayback Machine (archive.org).
68106
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
69107

70108
- **Range Selection:**<br>
71-
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
72-
(year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
109+
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range`, the `start` and `end` will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
110+
(year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
73111

74112
- **`-r`**, **`--range`**:<br>
75113
Specify the range in years for which to search and download snapshots.
@@ -105,9 +143,6 @@ This tool allows you to download content from the Wayback Machine (archive.org).
105143
- **`--verbose`**:<br>
106144
Increase output verbosity.
107145

108-
<!-- - **`--verbosity`** `<level>`:<br>
109-
Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
110-
111146
- **`--log`** <!-- `<path>` -->:<br>
112147
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
113148

@@ -126,9 +161,6 @@ Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
126161
- **`--delay`** `<seconds>`:<br>
127162
Specifies delay between download requests in seconds. Default is no delay (0).
128163

129-
<!-- - **`--convert-links`**:<br>
130-
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
131-
132164
#### Job Handling:
133165

134166
- **`--reset`**:
@@ -226,9 +258,7 @@ your/path/waybackup_snapshots/
226258

227259
### CSV
228260

229-
Each snapshot is stored with the following keys/values. These are either stored in a sqlite database while the download is running or saved into a CSV file after the download is finished.
230-
231-
For download queries:
261+
The CSV contains a snapshot per row:
232262

233263
```
234264
[

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ dependencies = [
2525
]
2626

2727
[project.scripts]
28-
waybackup = "pywaybackup.main:main"
28+
waybackup = "pywaybackup.main:cli"
2929

3030
[project.urls]
3131
homepage = "https://github.com/bitdruid/python-wayback-machine-downloader"

pywaybackup/Arguments.py

Lines changed: 38 additions & 138 deletions
Original file line numberDiff line numberDiff line change
@@ -1,157 +1,57 @@
1-
21
import sys
3-
import os
42
import argparse
53

64
from argparse import RawTextHelpFormatter
75

86
from importlib.metadata import version
97

10-
from pywaybackup.helper import url_split, sanitize_filename
118

129
class Arguments:
13-
1410
def __init__(self):
1511
parser = argparse.ArgumentParser(
1612
description=f"<<< python-wayback-machine-downloader v{version('pywaybackup')} >>>\nby @bitdruid -> https://github.com/bitdruid",
1713
formatter_class=RawTextHelpFormatter,
1814
)
1915

20-
required = parser.add_argument_group('required (one exclusive)')
21-
required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
16+
required = parser.add_argument_group("required (one exclusive)")
17+
required.add_argument("-u", "--url", type=str, metavar="", help="url (with subdir/subdomain) to download")
2218
exclusive_required = required.add_mutually_exclusive_group(required=True)
23-
exclusive_required.add_argument('-a', '--all', action='store_true', help='download snapshots of all timestamps')
24-
exclusive_required.add_argument('-l', '--last', action='store_true', help='download the last version of each file snapshot')
25-
exclusive_required.add_argument('-f', '--first', action='store_true', help='download the first version of each file snapshot')
26-
exclusive_required.add_argument('-s', '--save', action='store_true', help='save a page to the wayback machine')
27-
28-
optional = parser.add_argument_group('optional query parameters')
29-
optional.add_argument('-e', '--explicit', action='store_true', help='search only for the explicit given url')
30-
optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
31-
optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
32-
optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
33-
optional.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
34-
optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (js,css,...)')
35-
optional.add_argument('--statuscode', type=str, metavar="", help='statuscodes to download comma separated (200,404,...)')
36-
37-
behavior = parser.add_argument_group('manipulate behavior')
38-
behavior.add_argument('-o', '--output', type=str, metavar="", help='output for all files - defaults to current directory')
39-
behavior.add_argument('-m', '--metadata', type=str, metavar="", help='change directory for db/cdx/csv/log files')
40-
behavior.add_argument('-v', '--verbose', action='store_true', help='overwritten by progress - gives detailed output')
41-
behavior.add_argument('--log', action='store_true', help='save a log file into the output folder')
42-
behavior.add_argument('--progress', action='store_true', help='show a progress bar')
43-
behavior.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
44-
behavior.add_argument('--retry', type=int, default=0, metavar="", help='retry failed downloads (opt tries as int, else infinite)')
45-
behavior.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
46-
# behavior.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
47-
behavior.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
19+
exclusive_required.add_argument("-a", "--all", action="store_true", help="download snapshots of all timestamps")
20+
exclusive_required.add_argument("-l", "--last", action="store_true", help="download the last version of each file snapshot")
21+
exclusive_required.add_argument("-f", "--first", action="store_true", help="download the first version of each file snapshot")
22+
exclusive_required.add_argument("-s", "--save", action="store_true", help="save a page to the wayback machine")
23+
24+
optional = parser.add_argument_group("optional query parameters")
25+
optional.add_argument("-e", "--explicit", action="store_true", help="search only for the explicit given url")
26+
optional.add_argument("-r", "--range", type=int, metavar="", help="range in years to search")
27+
optional.add_argument("--start", type=int, metavar="", help="start timestamp format: YYYYMMDDhhmmss")
28+
optional.add_argument("--end", type=int, metavar="", help="end timestamp format: YYYYMMDDhhmmss")
29+
optional.add_argument("--limit", type=int, nargs="?", const=True, metavar="int", help="limit the number of snapshots to download")
30+
optional.add_argument("--filetype", type=str, metavar="", help="filetypes to download comma separated (js,css,...)")
31+
optional.add_argument("--statuscode", type=str, metavar="", help="statuscodes to download comma separated (200,404,...)")
32+
33+
behavior = parser.add_argument_group("manipulate behavior")
34+
behavior.add_argument("-o", "--output", type=str, metavar="", help="output for all files - defaults to current directory")
35+
behavior.add_argument("-m", "--metadata", type=str, metavar="", help="change directory for db/cdx/csv/log files")
36+
behavior.add_argument("-v", "--verbose", action="store_true", help="overwritten by progress - gives detailed output")
37+
behavior.add_argument("--log", action="store_true", help="save a log file into the output folder")
38+
behavior.add_argument("--progress", action="store_true", help="show a progress bar")
39+
behavior.add_argument("--no-redirect", action="store_true", help="do not follow redirects by archive.org")
40+
behavior.add_argument("--retry", type=int, default=0, metavar="", help="retry failed downloads (opt tries as int, else infinite)")
41+
behavior.add_argument("--workers", type=int, default=1, metavar="", help="number of workers (simultaneous downloads)")
42+
behavior.add_argument("--delay", type=int, default=0, metavar="", help="delay between each download in seconds")
43+
44+
special = parser.add_argument_group("special")
45+
special.add_argument("--reset", action="store_true", help="reset the job and ignore existing cdx/db/csv files")
46+
special.add_argument("--keep", action="store_true", help="keep all files after the job finished")
47+
48+
args = parser.parse_args(args=None if sys.argv[1:] else ["--help"]) # if no arguments are given, print help
49+
50+
args.silent = False
51+
args.debug = True
4852

49-
special = parser.add_argument_group('special')
50-
special.add_argument('--reset', action='store_true', help='reset the job and ignore existing cdx/db/csv files')
51-
special.add_argument('--keep', action='store_true', help='keep all files after the job finished')
52-
53-
args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help
54-
55-
required_args = {action.dest: getattr(args, action.dest) for action in exclusive_required._group_actions}
56-
optional_args = {action.dest: getattr(args, action.dest) for action in optional._group_actions}
57-
args.query_identifier = str(args.url) + str(required_args) + str(optional_args)
58-
59-
# if args.convert_links and not args.current:
60-
# parser.error("--convert-links can only be used with the -c/--current option")
61-
6253
self.args = args
63-
64-
def get_args(self):
65-
return self.args
66-
67-
class Configuration:
68-
69-
# def __init__(self):
70-
# self.args = Arguments().get_args()
71-
# for key, value in vars(self.args).items():
72-
# setattr(Configuration, key, value)
73-
74-
# self.set_config()
75-
76-
# def set_config(self):
77-
# # args now attributes of Configuration // Configuration.output, ...
78-
# self.command = ' '.join(sys.argv[1:])
79-
# self.domain, self.subdir, self.filename = url_split(self.url)
80-
81-
# if self.output is None:
82-
# self.output = os.path.join(os.getcwd(), "waybackup_snapshots")
83-
# if self.metadata is None:
84-
# self.metadata = self.output
85-
# os.makedirs(self.output, exist_ok=True) if not self.save else None
86-
# os.makedirs(self.metadata, exist_ok=True) if not self.save else None
87-
88-
# if self.all:
89-
# self.mode = "all"
90-
# if self.last:
91-
# self.mode = "last"
92-
# if self.first:
93-
# self.mode = "first"
94-
# if self.save:
95-
# self.mode = "save"
96-
97-
# if self.filetype:
98-
# self.filetype = [f.lower().strip() for f in self.filetype.split(",")]
99-
# if self.statuscode:
100-
# self.statuscode = [s.lower().strip() for s in self.statuscode.split(",")]
101-
102-
# base_path = self.metadata
103-
# base_name = f"waybackup_{sanitize_filename(self.url)}"
104-
# self.cdxfile = os.path.join(base_path, f"{base_name}.cdx")
105-
# self.dbfile = os.path.join(base_path, f"{base_name}.db")
106-
# self.csvfile = os.path.join(base_path, f"{base_name}.csv")
107-
# self.log = os.path.join(base_path, f"{base_name}.log") if self.log else None
108-
109-
# if self.reset:
110-
# os.remove(self.cdxfile) if os.path.isfile(self.cdxfile) else None
111-
# os.remove(self.dbfile) if os.path.isfile(self.dbfile) else None
112-
# os.remove(self.csvfile) if os.path.isfile(self.csvfile) else None
113-
114-
115-
@classmethod
116-
def init(cls):
117-
118-
cls.args = Arguments().get_args()
119-
for key, value in vars(cls.args).items():
120-
setattr(Configuration, key, value)
121-
122-
# args now attributes of Configuration // Configuration.output, ...
123-
cls.command = ' '.join(sys.argv[1:])
124-
cls.domain, cls.subdir, cls.filename = url_split(cls.url)
125-
126-
if cls.output is None:
127-
cls.output = os.path.join(os.getcwd(), "waybackup_snapshots")
128-
if cls.metadata is None:
129-
cls.metadata = cls.output
130-
os.makedirs(cls.output, exist_ok=True) if not cls.save else None
131-
os.makedirs(cls.metadata, exist_ok=True) if not cls.save else None
132-
133-
if cls.all:
134-
cls.mode = "all"
135-
if cls.last:
136-
cls.mode = "last"
137-
if cls.first:
138-
cls.mode = "first"
139-
if cls.save:
140-
cls.mode = "save"
141-
142-
if cls.filetype:
143-
cls.filetype = [f.lower().strip() for f in cls.filetype.split(",")]
144-
if cls.statuscode:
145-
cls.statuscode = [s.lower().strip() for s in cls.statuscode.split(",")]
14654

147-
base_path = cls.metadata
148-
base_name = f"waybackup_{sanitize_filename(cls.url)}"
149-
cls.cdxfile = os.path.join(base_path, f"{base_name}.cdx")
150-
cls.dbfile = os.path.join(base_path, f"{base_name}.db")
151-
cls.csvfile = os.path.join(base_path, f"{base_name}.csv")
152-
cls.log = os.path.join(base_path, f"{base_name}.log") if cls.log else None
153-
154-
if cls.reset:
155-
os.remove(cls.cdxfile) if os.path.isfile(cls.cdxfile) else None
156-
os.remove(cls.dbfile) if os.path.isfile(cls.dbfile) else None
157-
os.remove(cls.csvfile) if os.path.isfile(cls.csvfile) else None
55+
def get_args(self) -> dict:
56+
"""Returns the parsed arguments as a dictionary."""
57+
return vars(self.args)

pywaybackup/Exception.py

Lines changed: 25 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,9 @@ class Exception:
1414
command = None
1515

1616
@classmethod
17-
def init(cls, output=None, command=None):
17+
def init(cls, debug=None, output=None, command=None):
1818
sys.excepthook = cls.exception_handler # set custom exception handler (uncaught exceptions)
19+
cls.debug = debug
1920
cls.output = output
2021
cls.command = command
2122

@@ -44,32 +45,30 @@ def exception(cls, message: str, e: Exception, tb=None):
4445
exception_message += "!-- Traceback is None\n"
4546
exception_message += f"!-- Description: {e}\n-------------------------"
4647
print(exception_message)
47-
debug_file = os.path.join(cls.output, "waybackup_error.log")
48-
print(f"Exception log: {debug_file}")
49-
# print("-------------------------")
50-
# print(f"Full traceback:\n{original_tb}")
51-
if cls.new_debug: # new run, overwrite file
52-
cls.new_debug = False
53-
f = open(debug_file, "w", encoding="utf-8")
48+
if cls.debug:
49+
print(f"Exception log: {cls.debug}")
50+
if cls.new_debug: # new run, overwrite file
51+
cls.new_debug = False
52+
f = open(cls.debug, "w", encoding="utf-8")
53+
f.write("-------------------------\n")
54+
f.write(f"Version: {version('pywaybackup')}\n")
55+
f.write("-------------------------\n")
56+
f.write(f"Command: {cls.command}\n")
57+
f.write("-------------------------\n\n")
58+
else: # current run, append to file
59+
f = open(cls.debug, "a", encoding="utf-8")
60+
f.write(datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "\n")
61+
f.write(exception_message + "\n")
62+
f.write("!-- Local Variables:\n")
63+
for var_name, value in local_vars.items():
64+
if var_name in ["status_message", "headers"]:
65+
continue
66+
value = cls.relativate_path(str(value))
67+
value = value[:666] + " ... " if len(value) > 666 else value
68+
f.write(f" -- {var_name} = {value}\n")
5469
f.write("-------------------------\n")
55-
f.write(f"Version: {version('pywaybackup')}\n")
56-
f.write("-------------------------\n")
57-
f.write(f"Command: {cls.command}\n")
58-
f.write("-------------------------\n\n")
59-
else: # current run, append to file
60-
f = open(debug_file, "a", encoding="utf-8")
61-
f.write(datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "\n")
62-
f.write(exception_message + "\n")
63-
f.write("!-- Local Variables:\n")
64-
for var_name, value in local_vars.items():
65-
if var_name in ["status_message", "headers"]:
66-
continue
67-
value = cls.relativate_path(str(value))
68-
value = value[:666] + " ... " if len(value) > 666 else value
69-
f.write(f" -- {var_name} = {value}\n")
70-
f.write("-------------------------\n")
71-
f.write(original_tb + "\n")
72-
f.close()
70+
f.write(original_tb + "\n")
71+
f.close()
7372

7473
@classmethod
7574
def relativate_path(cls, input_str: str) -> str:

0 commit comments

Comments
 (0)