Skip to content

OOV tokens are deleted by English g2p conversion. #751

@joanise

Description

@joanise

Bug description

If your text includes OOVS, like digits, typos or unknown words (where unknown words are those not in the CMU dict use to build the g2p English mapping), they are simply stripped out of the text before training or synthesis if you are converting to phones.

E.g., testing 123 testings test gets g2p'd to tɛstɪŋ tɛst which is not great for training, and potentially catastrophic for synthesis.

When a given utterance constitutes exclusively of OOVs, you get a stack dump as described in #741.

This problem was noticed by @marctessier a few weeks ago.

Possible suggestions by @roedoejet

  • fall back to und, like readalongs does?
  • fall back to a neural g2p model?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions