Skip to content

Invalid white space character exception #3

@ngndn

Description

@ngndn

Hi,

I'm using wikiforia (version 1.1.1) to parse the english wikipedia dump ("version" 20150602) and encounter this error and it makes wikiforia stop. Below is the log. How can I fix this?

java.io.IOError: com.ctc.wstx.exc.WstxIOException: Invalid white space character (0x14) in text to output (in xml 1.1, could output as a character entity)
at se.lth.cs.nlp.io.XmlWikipediaPageWriter.process(XmlWikipediaPageWriter.java:91) ~[wikiforia-1.1.1.jar:?]
at se.lth.cs.nlp.pipeline.AbstractEmitter.output(AbstractEmitter.java:44) ~[wikiforia-1.1.1.jar:?]
at se.lth.cs.nlp.pipeline.IdentityFilter.process(IdentityFilter.java:11) ~[wikiforia-1.1.1.jar:?]
at se.lth.cs.nlp.pipeline.AbstractEmitter.output(AbstractEmitter.java:44) ~[wikiforia-1.1.1.jar:?]
at se.lth.cs.nlp.wikipedia.parser.SwebleWikimarkupParserBase.process(SwebleWikimarkupParserBase.java:91) ~[wikiforia-1.1.1.jar:?]
at se.lth.cs.nlp.pipeline.AbstractEmitter.output(AbstractEmitter.java:44) ~[wikiforia-1.1.1.jar:?]
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser.access$300(MultistreamBzip2XmlDumpParser.java:43) ~[wikiforia-1.1.1.jar:?]
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$Worker.run(MultistreamBzip2XmlDumpParser.java:349) ~[wikiforia-1.1.1.jar:?]
Caused by: com.ctc.wstx.exc.WstxIOException: Invalid white space character (0x14) in text to output (in xml 1.1, could output as a character entity)
at com.ctc.wstx.sw.BaseStreamWriter.writeCharacters(BaseStreamWriter.java:462) ~[woodstox-core-asl-4.2.0.jar:4.2.0]
at se.lth.cs.nlp.io.XmlWikipediaPageWriter.process(XmlWikipediaPageWriter.java:85) ~[wikiforia-1.1.1.jar:?]
... 7 more
Caused by: java.io.IOException: Invalid white space character (0x14) in text to output (in xml 1.1, could output as a character entity)
at com.ctc.wstx.api.InvalidCharHandler$FailingHandler.convertInvalidChar(InvalidCharHandler.java:55) ~[woodstox-core-asl-4.2.0.jar:4.2.0]
at com.ctc.wstx.sw.XmlWriter.handleInvalidChar(XmlWriter.java:623) ~[woodstox-core-asl-4.2.0.jar:4.2.0]
at com.ctc.wstx.sw.BufferingXmlWriter.writeCharacters(BufferingXmlWriter.java:554) ~[woodstox-core-asl-4.2.0.jar:4.2.0]
at com.ctc.wstx.sw.BaseStreamWriter.writeCharacters(BaseStreamWriter.java:460) ~[woodstox-core-asl-4.2.0.jar:4.2.0]
at se.lth.cs.nlp.io.XmlWikipediaPageWriter.process(XmlWikipediaPageWriter.java:85) ~[wikiforia-1.1.1.jar:?]
... 7 more
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:400)
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.getNext(MultistreamBzip2XmlDumpParser.java:286)
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.read(MultistreamBzip2XmlDumpParser.java:305)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:232)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.complete(BZip2CompressorInputStream.java:348)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:284)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:868)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:917)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:217)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:172)
at com.ctc.wstx.io.BaseReader.readBytes(BaseReader.java:155)
at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:368)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:111)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:87)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:991)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4647)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4147)
at com.ctc.wstx.sr.BasicStreamReader.getElementText(BasicStreamReader.java:679)
at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.processPages(XmlDumpParser.java:276)
at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.next(XmlDumpParser.java:337)
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$Worker.run(MultistreamBzip2XmlDumpParser.java:345)
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:400)
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.getNext(MultistreamBzip2XmlDumpParser.java:286)
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.read(MultistreamBzip2XmlDumpParser.java:305)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:232)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.complete(BZip2CompressorInputStream.java:348)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:284)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:868)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:917)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:217)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:172)
at com.ctc.wstx.io.BaseReader.readBytes(BaseReader.java:155)
at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:368)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:111)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:87)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:991)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4647)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4147)
at com.ctc.wstx.sr.BasicStreamReader.getElementText(BasicStreamReader.java:679)
at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.processPages(XmlDumpParser.java:276)
at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.next(XmlDumpParser.java:337)
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$Worker.run(MultistreamBzip2XmlDumpParser.java:345)
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:400)
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.getNext(MultistreamBzip2XmlDumpParser.java:286)
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$ParallelDumpStream.read(MultistreamBzip2XmlDumpParser.java:305)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:232)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.complete(BZip2CompressorInputStream.java:348)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:284)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:868)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:917)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:217)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:172)
at com.ctc.wstx.io.BaseReader.readBytes(BaseReader.java:155)
at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:368)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:111)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:87)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:991)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4647)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4147)
at com.ctc.wstx.sr.BasicStreamReader.getElementText(BasicStreamReader.java:679)
at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.processPages(XmlDumpParser.java:276)
at se.lth.cs.nlp.mediawiki.parser.XmlDumpParser.next(XmlDumpParser.java:337)
at se.lth.cs.nlp.mediawiki.parser.MultistreamBzip2XmlDumpParser$Worker.run(MultistreamBzip2XmlDumpParser.java:345)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions