pip
.) There are two major ways to install PyRosetta: either into your standard system Python, or using a Python environment manager. Installation into the system Python is easier and makes PyRosetta available at all times, though it makes upgrading PyRosetta more difficult and may require administrator access. Using an environment manager is more flexible and permits installation as a normal (non-admin) user, but requires more effort in understanding the system. tar -vjxf PyRosetta-<version>.tar.bz2
. cd setup && sudo python setup.py install
into the command line to set up the PyRosetta.import pyrosetta; pyrosetta.init()
.tar -vjxf PyRosetta-<version>.tar.bz2
. cd setup && python setup.py install
into the command line to set up the PyRosetta.import pyrosetta; pyrosetta.init()
.gtts.tokenizer
module powers the default pre-processing and tokenizing features of gTTS
and provides tools to easily expand them. gtts.tts.gTTS
takes two arguments pre_processor_funcs
(list of functions) and tokenizer_func
(function). See: Pre-processing, Tokenizing.gtts.tokenizer
)gTTS
context, its goal is to cut the text into smaller segments that do not exceed the maximum character size allowed for each TTS API request, while making the speech sound natural and continuous.It does so by splitting text where speech would naturaly pause (for example on “.”) while handling where it should not (for example on “10.5” or “U.S.A.”). Such rules are called tokenizer cases, which it takes a list of.gtts.tokenizer.core.Tokenizer
. More specefically, it returns a regex
object that describes what to look for for a particular case. gtts.tokenizer.core.Tokenizer
then creates its main regex pattern by joining all tokenizer cases with “|”.gtts.tts.gTTS
’s pre_processor_funcs
attribute to act as pre-processor (as long as it takes a string and returns a string).gtts.tts.gTTS
takes a list of the following pre-processors, applied in order:gtts.tokenizer.pre_processors.
abbreviations
(text)[source]¶PreProcessorSub
pre-processor. Ex.: ‘Esq.’, ‘Esquire’.gtts.tokenizer.pre_processors.
end_of_line
(text)[source]¶gtts.tokenizer.pre_processors.
tone_marks
(text)[source]¶gtts.tokenizer.pre_processors.
word_sub
(text)[source]¶gtts.tokenizer.core.PreProcessorRegex
(for regex-based replacing, as would re.sub
use)gtts.tokenizer.core.PreProcessorSub
(for word-for-word replacements).run(text)
method of those objects returns the processed text.gtts.tokenizer.symbols.SUB_PAIRS
list. Add a custom one by appending to it:gtts.tokenizer.symbols.ABBREVIATIONS
list. Add a custom one to it to add a new abbreviation to remove the period from. Note: the default list already includes an extensive list of English abbreviations that Google Translate will read even without the period.gtts.tokenizer.pre_processors
for more examples.gtts.tts.gTTS
’s tokenizer_func
attribute to act as tokenizer (as long as it takes a string and returns a list of strings).gTTS
takes the gtts.tokenizer.core.Tokenizer
’s gtts.tokenizer.core.Tokenizer.run()
, initialized with default tokenizer cases:gtts.tokenizer.tokenizer_cases.
colon
()[source]¶gtts.tokenizer.tokenizer_cases.
legacy_all_punctuation
()[source]¶gtts.tokenizer.tokenizer_cases.
other_punctuation
()[source]¶gtts.tokenizer.tokenizer_cases.
period_comma
()[source]¶gtts.tokenizer.tokenizer_cases.
tone_marks
()[source]¶re.split()
context.gtts.tokenizer.core.Tokenizer
takes a list of tokenizer cases and joins their pattern with “|” in one single pattern.gtts.tokenizer.core.RegexBuilder
. See gtts.tokenizer.core.RegexBuilder
and gtts.tokenizer.tokenizer_cases
for examples.gtts.tokenizer.core.Tokenizer
works well in this context, there are way more advanced tokenizers and tokenzing techniques. As long as you can restrict the lenght of output tokens, you can use any tokenizer you’d like, such as the ones in NLTK.gtts.tokenizer
)¶gtts.tokenizer.core.
RegexBuilder
(pattern_args, pattern_func, flags=0)[source]¶pattern_func
to create a regex pattern. Each element isre.escape
’d before being passed.pattern_args
and return a valid regex pattern group string.re
flag(s) to compile with the regex.rb.regex
we get the following compiled regex:rb.regex
we get the following compiled regex:gtts.tokenizer.core.
PreProcessorRegex
(search_args, search_func, repl, flags=0)[source]¶re.sub
) from each regex
of agtts.tokenizer.core.RegexBuilder
with an extra repl
replacement parameter.search_func
to create a regex pattern. Each element isre.escape
’d before being passed.search_args
and return a valid regex search pattern string.sub
method foreach regex
. Can be a raw string (the case of a regexbackreference, for example)re
flag(s) to compile with each regex.1
(as a raw string). Looking at pp
we get thefollowing list of search/replacement pairs:gtts.tokenizer.pre_processors
for more examples.run
(text)[source]¶text
.gtts.tokenizer.core.
PreProcessorSub
(sub_pairs, ignore_case=True)[source]¶gtts.tokenizer.core.PreProcessorRegex
with a defaultsimple substitution regex.(<searchstr>,<replacestr>)
True
.pp
, we get the following list ofsearch (regex)/replacement pairs:gtts.tokenizer.pre_processors
for more examples.run
(text)[source]¶text
.gtts.tokenizer.core.
Tokenizer
(regex_funcs, flags=<RegexFlag.IGNORECASE: 2>)[source]¶regex
objects and joins them by“|” (regex alternation ‘or’) to create a single regex to use with thestandard regex.split()
function.regex_funcs
is a list of any function that can return a regex
(from re.compile()
) object, such as agtts.tokenizer.core.RegexBuilder
instance (and its regex
attribute).gtts.tokenizer.tokenizer_cases
module for examples.regex
objects. Eachfunctions’s pattern will be joined into a single pattern andcompiled.re
flag(s) to compile with the final regex. Defaults tore.IGNORECASE
regex
objects obtained from regex_funcs
are joined,their individual re
flags are ignored in favour of flags
.regex_funcs
is not a function, or a function that does not return a compiled regex
object.regex
patterns can easily interfere with one another inunexpected ways. It is recommanded that each tokenizer case operateon distinct or non-overlapping chracters/sets of characters(For example, a tokenizer case for the period (“.”) should alsohandle not matching/cutting on decimals, instead of making thata seperate tokenizer case).case1().pattern
, we get:case2().pattern
, we get:t
, we get them combined:run
(text)[source]¶symbols.
ABBREVIATIONS
= ['dr', 'jr', 'mr', 'mrs', 'ms', 'msgr', 'prof', 'sr', 'st']¶symbols.
SUB_PAIRS
= [('Esq.', 'Esquire')]¶symbols.
ALL_PUNC
= '?!?!.,¡()[]¿…‥،;:—。,、:n'¶symbols.
TONE_MARKS
= '?!?!'¶