SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.
This repository is for 0.5.* version of SudachiPy, 0.6* and above are developed as Sudachi.rs.
$ pip install sudachipy sudachidict_core
$ echo "高輪ゲートウェイ駅" | sudachipy
高輪ゲートウェイ駅 名詞,固有名詞,一般,*,*,* 高輪ゲートウェイ駅
EOS
$ echo "高輪ゲートウェイ駅" | sudachipy -m A
高輪 名詞,固有名詞,地名,一般,*,* 高輪
ゲートウェイ 名詞,普通名詞,一般,*,*,* ゲートウェー
駅 名詞,普通名詞,一般,*,*,* 駅
EOS
$ echo "空缶空罐空きカン" | sudachipy -a
空缶 名詞,普通名詞,一般,*,*,* 空き缶 空缶 アキカン 0
空罐 名詞,普通名詞,一般,*,*,* 空き缶 空罐 アキカン 0
空きカン 名詞,普通名詞,一般,*,*,* 空き缶 空きカン アキカン 0
EOSYou need SudachiPy and a dictionary.
$ pip install sudachipyYou can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the core edition).
$ pip install sudachidict_coreAlternatively, you can choose other dictionary editions. See this section for the detail.
There is a CLI command sudachipy.
$ echo "外国人参政権" | sudachipy
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権
EOS
$ echo "外国人参政権" | sudachipy -m A
外国 名詞,普通名詞,一般,*,*,* 外国
人 接尾辞,名詞的,一般,*,*,* 人
参政 名詞,普通名詞,一般,*,*,* 参政
権 接尾辞,名詞的,一般,*,*,* 権
EOS$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]
[-a] [-d] [-v]
[file [file ...]]
Tokenize Text
positional arguments:
file text written in utf-8
optional arguments:
-h, --help show this help message and exit
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-s string sudachidict type
-a print all of the fields
-d print the debug information
-v, --version print sudachipy versionColumns are tab separated.
- Surface
- Part-of-Speech Tags (comma separated)
- Normalized Form
When you add the -a option, it additionally outputs
- Dictionary Form
- Reading Form
- Dictionary ID
0for the system dictionary1and above for the user dictionaries-1\t(OOV)if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "外国人参政権" | sudachipy -a
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 外国人参政権 ガイコクジンサンセイケン 0
EOSecho "阿quei" | sudachipy -a
阿 名詞,普通名詞,一般,*,*,* 阿 阿 -1 (OOV)
quei 名詞,普通名詞,一般,*,*,* quei quei -1 (OOV)
EOSHere is an example;
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()# Multi-granular Tokenization
mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家公務員']
mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務員']
mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務', '員']# Morpheme information
m = tokenizer_obj.tokenize("食べ", mode)[0]
m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']# Normalization
tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'(With 20200330 core dictionary. The results may change when you use other versions)
**WARNING: sudachipy link is no longer available in SudachiPy v0.5.2 and later. **
There are three editions of Sudachi Dictionary, namely, small, core, and full. See WorksApplications/SudachiDict for the detail.
SudachiPy uses sudachidict_core by default.
Dictionaries are installed as Python packages sudachidict_small, sudachidict_core, and sudachidict_full.
The dictionary files are not in the package itself, but it is downloaded upon installation.
You can specify the dictionary with the tokenize option -s.
$ pip install sudachidict_small
$ echo "外国人参政権" | sudachipy -s small$ pip install sudachidict_full
$ echo "外国人参政権" | sudachipy -s fullYou can specify the dictionary with the Dicionary() argument; config_path or dict_type.
class Dictionary(config_path=None, resource_dir=None, dict_type=None)config_path- You can specify the file path to the setting file with
config_path(See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail). - If the dictionary file is specified in the setting file as
systemDict, SudachiPy will use the dictionary.
- You can specify the file path to the setting file with
dict_type- You can also specify the dictionary type with
dict_type. - The available arguments are
small,core, orfull. - If different dictionaries are specified with
config_pathanddict_type, a dictionary defineddict_typeoverrides those defined in the config path.
- You can also specify the dictionary type with
from sudachipy import tokenizer
from sudachipy import dictionary
# default: sudachidict_core
tokenizer_obj = dictionary.Dictionary().create()
# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used
tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json").create()
# The dictionary specified by `dict_type` will be set.
tokenizer_obj = dictionary.Dictionary(dict_type="core").create() # sudachidict_core (same as default)
tokenizer_obj = dictionary.Dictionary(dict_type="small").create() # sudachidict_small
tokenizer_obj = dictionary.Dictionary(dict_type="full").create() # sudachidict_full
# The dictionary specified by `dict_type` overrides those defined in the config path.
# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.
tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create() Alternatively, if the dictionary file is specified in the setting file, sudachi.json, SudachiPy will use that file.
{
"systemDict" : "relative/path/to/system.dic",
...
}
The default setting file is sudachipy/resources/sudachi.json. You can specify your sudachi.json with the -r option.
$ sudachipy -r path/to/sudachi.jsonTo use a user dictionary, user.dic, place sudachi.json to anywhere you like, and add userDict value with the relative path from sudachi.json to your user.dic.
{
"userDict" : ["relative/path/to/user.dic"],
...
}Then specify your sudachi.json with the -r option.
$ sudachipy -r path/to/sudachi.jsonYou can build a user dictionary with the subcommand ubuild.
WARNING: v0.3.* ubuild contains bug.
$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]
Build User Dictionary
positional arguments:
file source files with CSV format (one or more)
optional arguments:
-h, --help show this help message and exit
-d string description comment to be embedded on dictionary
-o file output file (default: user.dic)
-s file system dictionary path (default: system core dictionary path)About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).
$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]
Build Sudachi Dictionary
positional arguments:
file source files with CSV format (one of more)
optional arguments:
-h, --help show this help message and exit
-o file output file (default: system.dic)
-d string description comment to be embedded on dictionary
required named arguments:
-m file connection matrix file with MeCab's matrix.def formatTo use your customized system.dic, place sudachi.json to anywhere you like, and overwrite systemDict value with the relative path from sudachi.json to your system.dic.
{
"systemDict" : "relative/path/to/system.dic",
...
}
Then specify your sudachi.json with the -r option.
$ sudachipy -r path/to/sudachi.json$ python setup.py build_ext --inplaceRun scripts/format.sh to check if your code is formatted correctly.
You need packages flake8 flake8-import-order flake8-buitins (See requirements.txt).
Run scripts/test.sh to run the tests.
Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
https://sudachi-dev.slack.com/ (Get invitation here)
Enjoy tokenization!