A tokenizer, text cleaner, and phonemizer for many human languages.
Project description
Gruut
A tokenizer, text cleaner, and IPA phonemizer for several human languages.
from gruut import text_to_phonemes
text = 'He wound it around the wound, saying "I read it was $10 to read."'
for sent_idx, word, word_phonemes in text_to_phonemes(text, lang="en-us"):
print(word, *word_phonemes)
which outputs:
he h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
i ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖
Note that "wound" and "read" have different pronunciations when used in different contexts.
See the documentation for more details.
Installation
$ pip install gruut
Additional languages can be added during installation. For example, with French and Italian support:
$ pip install gruut[fr,it]
You may also manually download language files and use the --lang-dir
option:
$ gruut <lang> <command> --lang-dir /path/to/language-files/
Extracting the files to $HOME/.config/gruut/
will allow gruut to automatically make use of them. gruut will look for language files in the directory $HOME/.config/gruut/<lang>/
if the corresponding Python package is not installed. Note that <lang>
here is the full language name, e.g. de-de
instead of just de
.
Supported Languages
gruut currently supports:
- Czech (
cs
orcs-cz
) - German (
de
orde-de
) - English (
en
oren-us
) - Spanish (
es
ores-es
) - Farsi/Persian (
fa
) - French (
fr
orfr-fr
) - Italian (
it
orit-it
) - Dutch (
nl
) - Russian (
ru
orru-ru
) - Swedish (
sv
orsv-se
)
The goal is to support all of voice2json's languages
Dependencies
- Python 3.7 or higher
- Linux
- Tested on Debian Buster
- Babel and num2words
- Currency/number handling
- gruut-ipa
- IPA pronunciation manipulation
- pycrfsuite
- Part of speech tagging and grapheme to phoneme models
Command-Line Usage
The gruut
module can be executed with python3 -m gruut <LANGUAGE> <COMMAND> <ARGS>
The commands are line-oriented, consuming/producing either text or JSONL. They can be composed to produce a pipeline for cleaning text.
You will probably want to install jq to manipulate the JSONL output from gruut
.
tokenize
Takes raw text and outputs JSONL with cleaned words/tokens.
$ echo 'This, right here, is some RAW text!' \
| python3 -m gruut en-us tokenize \
| jq -c .clean_words
["this", ",", "right", "here", ",", "is", "some", "raw", "text", "!"]
See python3 -m gruut <LANGUAGE> tokenize --help
for more options.
phonemize
Takes JSONL output from tokenize
and produces JSONL with phonemic pronunciations.
$ echo 'This, right here, is some RAW text!' \
| python3 -m gruut en-us tokenize \
| python3 -m gruut en-us phonemize \
| jq -c .pronunciation_text
ð ɪ s | ɹ aɪ t h iː ɹ | ɪ z s ʌ m ɹ ɑː t ɛ k s t ‖
See python3 -m gruut <LANGUAGE> phonemize --help
for more options.
Intended Audience
gruut is useful for transforming raw text into phonetic pronunciations, similar to phonemizer. Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a carefully chosen inventory.
For each supported language, gruut includes a:
- A word pronunciation lexicon built from open source data
- See pron_dict
- A pre-trained grapheme-to-phoneme model for guessing word pronunciations
Some languages also include:
- A pre-trained part of speech tagger built from open source data:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.