Skip to content

Conversation

@bobbyjaros
Copy link

@bobbyjaros bobbyjaros commented Apr 26, 2016

Adding nnparse.exe and SeqToSeqData.scala, which together can go from paired text files to the formatted matrices consumed by SeqToSeq

[Just a cleaner version of #77 (which had some extra mods unrelated to this PR)]

Bobby Jaros added 4 commits December 17, 2015 22:36
newparse can optionally output paragraphids and sentenceids for each token.
        p1 s1 w1
        p2 s2 w2
        p3 s3 w3
        p4 s4 w4
        p5 s5 w5
        p6 s6 w6

nnparse harnesses this functionality in a very simple version of this, which
assumes each newline denotes a paragraph and each ". " or "? " or "! "
denotes a new sentence.
Starts with the output of nnparse.exe, two paired files each with this format:
         p1 s1 w1
         p2 s2 w2
         p3 s3 w3
         p4 s4 w4
         p5 s5 w5
         p6 s6 w6

(For SeqToSeq we assume each line contains one sentence, so the paragraphid
(the first column) denotes the sentence and sentenceid (the second column)
is always ignored).

The two parsed sentence IMats are paired line-by-line:  the ith line of the
src IMat corresponds to the ith line of the dst IMat.

Produces two paired SMat's of the following form:
         w00  w01  w02  w03  w04  w05  ...
         w10  w11  w12  w13  w14  w15P ...
         w20  w21  w22  w23P w24  w25P ...
         w30  w31P w32                 ...
         w40P w32P w33                 ...

where
   wij is the dictionary index of the i'th word in the j'th sentence and
   words with a P suffix are padding symbols.

The columns of the two output SMat's are still paired:  column j of the
src output SMat and column j of the dst output SMat correspond to line j
of the src input and line j of the dst input respectively.

Furthermore, the sentences are collated into batches of similar lengths.

The minibatches are randomly permuted after collation to avoid training bias.

See in-file docs for additional options.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant