Skip to content

try/except UTF-8 UnicodeDecodeError and go to next record #6

@clemsos

Description

@clemsos

Hi there,

I want to index large csv files containing a corpus of tweets in Chinese. All files are UTF-8 but some characters seems to have encoding problems. Is there a way to prevent the import from breaking and instead just skip the bad record?
I did that on previous analysis on this corpus with python and it went ok.

Thanks !

$ python -m esimport -s localhost:9200 -f /home/clemsos/Dev/mitras/data/week10.csv -i weiboscope -t tweet
/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module dap was already imported from None, but /usr/lib/python2.7/dist-packages is being added to sys.path
  from pkg_resources import resource_stream
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/dist-packages/esimport/__main__.py", line 110, in <module>
    main(sys.argv[1:])
  File "/usr/local/lib/python2.7/dist-packages/esimport/__main__.py", line 96, in main
    verify = args.skip_verify)
  File "/usr/local/lib/python2.7/dist-packages/esimport/esimport.py", line 44, in import_data
    data_lines = utils.retrieve_file_lines(filename)
  File "/usr/local/lib/python2.7/dist-packages/esimport/utils.py", line 31, in retrieve_file_lines
    decoded_contents = retrieve_file(filename)
  File "/usr/local/lib/python2.7/dist-packages/esimport/utils.py", line 14, in retrieve_file
    decoded_contents = original_contents.decode('utf-8-sig').encode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
    (output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbc in position 1418516: invalid start byte

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions