public interface WordTokenizer
| Modifier and Type | Method and Description |
|---|---|
void |
addWordToSentence(java.util.List<java.lang.String> sentence,
java.lang.String word)
Add word to list of words in sentence.
|
void |
close()
Close down the word tokenizer.
|
java.util.List<java.lang.String> |
extractWords(java.lang.String text)
Break text into word tokens.
|
int[] |
findWordOffsets(java.lang.String sentenceText,
java.util.List<?> words)
Find starting offsets of words in a sentence.
|
PreTokenizer |
getPreTokenizer()
Get the preTokenizer.
|
java.lang.String |
preprocessToken(java.lang.String token,
java.util.List<java.lang.String> tokenList)
Preprocess a word token.
|
void |
setAbbreviations(Abbreviations abbreviations)
Set abbreviations.
|
void |
setAposTokens(AposTokens aposTokens)
Set apostophe tokens.
|
void |
setPreTokenizer(PreTokenizer preTokenizer)
Set the preTokenizer.
|
PreTokenizer getPreTokenizer()
void setPreTokenizer(PreTokenizer preTokenizer)
preTokenizer - The preTokenizer.void setAbbreviations(Abbreviations abbreviations)
abbreviations - Abbreviations.void setAposTokens(AposTokens aposTokens)
aposTokens - Apostrophe tokens.void addWordToSentence(java.util.List<java.lang.String> sentence,
java.lang.String word)
sentence - Result sentence.word - Word to add.java.util.List<java.lang.String> extractWords(java.lang.String text)
text - Text to break into word tokens.Word tokens may be words, numbers, punctuation, etc.
int[] findWordOffsets(java.lang.String sentenceText,
java.util.List<?> words)
sentenceText - Text from which tokens were
extracted.words - List of words extracted from
sentence text.
N.B. If the words aren't from
the specified sentence text,
the resulting offsets will be
meaningless.java.lang.String preprocessToken(java.lang.String token,
java.util.List<java.lang.String> tokenList)
token - Token to preprocess.tokenList - List of previous tokens already issued.void close()