Replies: 1 comment 1 reply
-
Initially I removed local files like js, css. So I convert all html files to md. I made a fork where I'm making necessary changes, don't use this code. #!/bin/bash
import os
import html2markdown
from bs4 import BeautifulSoup, Doctype
# reset
# find . -name \*.css -type f -delete
# find . -name \*.icon -type f -delete
# find . -name \*.ico -type f -delete
# find . -name \*.js -type f -delete
# find . -name \*.png -type f -delete
# find . -name \*.svg -type f -delete
# find . -name \*.jpeg -type f -delete
# find . -name \*.jpg -type f -delete
# find . -name \*.jfif -type f -delete
# find . -name \*.json -type f -delete
# find . -name \*.gif -type f -delete
directory = './data_sets/'
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
if filename.endswith('.html'):
fname = os.path.join(root, filename)
print('Filename: {}'.format(fname))
with open(fname) as handle:
soup = BeautifulSoup(handle.read(), 'html.parser')
for item in soup.contents:
if isinstance(item, Doctype):
print('Doctype: {}'.format(item))
break |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I pulled the old jnode.org website and put the contents here:
https://github.com/jnode-revisited/dataset-jnode.org
There is scripting work to do if anyone is familiar with manipulating text and specifically HTML.
Beta Was this translation helpful? Give feedback.
All reactions