public class BertFullTokenizer extends SimpleTokenizer
It will run basic preprocessors to clean the input text and then run WordpieceTokenizer to split into word pieces.
Reference implementation: Google Research Bert Tokenizer
| Constructor and Description |
|---|
BertFullTokenizer(SimpleVocabulary vocabulary,
boolean lowerCase)
Creates an instance of
BertFullTokenizer. |
| Modifier and Type | Method and Description |
|---|---|
static java.util.List<TextProcessor> |
getPreprocessors(boolean lowerCase)
Get a list of
TextProcessors to process input text for Bert models. |
SimpleVocabulary |
getVocabulary()
Returns the
SimpleVocabulary used for tokenization. |
java.util.List<java.lang.String> |
tokenize(java.lang.String input)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.
|
buildSentenceclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitpreprocesspublic BertFullTokenizer(SimpleVocabulary vocabulary, boolean lowerCase)
BertFullTokenizer.vocabulary - the BERT vocabularylowerCase - whether to convert tokens to lowercasepublic SimpleVocabulary getVocabulary()
SimpleVocabulary used for tokenization.SimpleVocabulary used for tokenizationpublic java.util.List<java.lang.String> tokenize(java.lang.String input)
tokenize in interface Tokenizertokenize in class SimpleTokenizerinput - the sentence to tokenizeList of tokenspublic static java.util.List<TextProcessor> getPreprocessors(boolean lowerCase)
TextProcessors to process input text for Bert models.lowerCase - whether to convert input to lowercaseTextProcessors