Raise on character encoding errors

I've been using Reverse Markdown and it works great most of the time. I've run into one issue that I thought I'd get your opinion on.

Sometimes the HTML documents I'm converting have character encoding problems, leading to th dreaded `Argument Error: invalid byte sequence in UTF-8`.

In other places I'm fixing this by coercing the lines of a file to UTF8 as I read them. I've discovered that when you parse a line you can generally just `force_encoding` on it, and that will convert typographic marks and whatnot pretty well, but occasionally you'll run into issues where it's not enough and you have to be more aggressive, ie. the following:

```ruby
def clean_line(line)
  # encoding must be utf8, if non-utf8 characters are encountered we remove them.
  # Weirdly though, this can fail, but then doesn't blow up until you call something else on the string...
  line.force_encoding("UTF-8").strip # strip will make this raise if it didn't work
rescue
  # ... in that case we want to selectively remove the offending characters.
   line.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
end
```

I end up using this same code to scrub HTML before I enter it into ReverseMarkdown, but it would probably be more efficient to handle it inside the gem - and would save other people from this same headache.

Are you interested in handling encoding errors inside the gem? If yes, you can use that code, or I can try to circle back with a PR. If not, no worries, just thought it might be worth considering.

Thanks for a great gem!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise on character encoding errors #73

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Raise on character encoding errors #73

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions