-
Notifications
You must be signed in to change notification settings - Fork 119
Description
I've been using Reverse Markdown and it works great most of the time. I've run into one issue that I thought I'd get your opinion on.
Sometimes the HTML documents I'm converting have character encoding problems, leading to th dreaded Argument Error: invalid byte sequence in UTF-8.
In other places I'm fixing this by coercing the lines of a file to UTF8 as I read them. I've discovered that when you parse a line you can generally just force_encoding on it, and that will convert typographic marks and whatnot pretty well, but occasionally you'll run into issues where it's not enough and you have to be more aggressive, ie. the following:
def clean_line(line)
# encoding must be utf8, if non-utf8 characters are encountered we remove them.
# Weirdly though, this can fail, but then doesn't blow up until you call something else on the string...
line.force_encoding("UTF-8").strip # strip will make this raise if it didn't work
rescue
# ... in that case we want to selectively remove the offending characters.
line.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
endI end up using this same code to scrub HTML before I enter it into ReverseMarkdown, but it would probably be more efficient to handle it inside the gem - and would save other people from this same headache.
Are you interested in handling encoding errors inside the gem? If yes, you can use that code, or I can try to circle back with a PR. If not, no worries, just thought it might be worth considering.
Thanks for a great gem!