-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Hi there,
I want to index large csv files containing a corpus of tweets in Chinese. All files are UTF-8 but some characters seems to have encoding problems. Is there a way to prevent the import from breaking and instead just skip the bad record?
I did that on previous analysis on this corpus with python and it went ok.
Thanks !
$ python -m esimport -s localhost:9200 -f /home/clemsos/Dev/mitras/data/week10.csv -i weiboscope -t tweet
/usr/local/lib/python2.7/dist-packages/pytz/__init__.py:29: UserWarning: Module dap was already imported from None, but /usr/lib/python2.7/dist-packages is being added to sys.path
from pkg_resources import resource_stream
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/esimport/__main__.py", line 110, in <module>
main(sys.argv[1:])
File "/usr/local/lib/python2.7/dist-packages/esimport/__main__.py", line 96, in main
verify = args.skip_verify)
File "/usr/local/lib/python2.7/dist-packages/esimport/esimport.py", line 44, in import_data
data_lines = utils.retrieve_file_lines(filename)
File "/usr/local/lib/python2.7/dist-packages/esimport/utils.py", line 31, in retrieve_file_lines
decoded_contents = retrieve_file(filename)
File "/usr/local/lib/python2.7/dist-packages/esimport/utils.py", line 14, in retrieve_file
decoded_contents = original_contents.decode('utf-8-sig').encode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbc in position 1418516: invalid start byte
Metadata
Metadata
Assignees
Labels
No labels