public class ICU4JBreakIteratorWordTokenizer extends AbstractWordTokenizer implements WordTokenizer, CanTokenizeWhitespace, CanSplitAroundPeriods
| Modifier and Type | Field and Description |
|---|---|
protected java.util.Locale |
locale
Locale.
|
protected boolean |
mergeWhitespaceTokens
Merge whitespace tokens.
|
protected boolean |
splitAroundPeriods
Check for potential splitting of tokens around periods.
|
protected boolean |
storeWhitespaceTokens
Store whitespace tokens.
|
protected java.lang.String |
wordBreakRulesFileName
Word break rules template file.
|
protected com.ibm.icu.text.BreakIterator |
wordIterator
The word based break iterator.
|
abbreviations, aposTokens, apostropheCanBeQuote, coalesceAsterisks, coalesceHyphens, contractions, contractionsURL, hyphensMatcher, hyphensPattern, logger, preTokenizer| Constructor and Description |
|---|
ICU4JBreakIteratorWordTokenizer()
Create a word tokenizer that uses the ICU4J word break iterator.
|
ICU4JBreakIteratorWordTokenizer(java.util.Locale locale)
Create a word tokenizer that uses the ICU4J word break iterator.
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
createWordIterator()
Create word based break iterator.
|
java.util.List<java.lang.String> |
extractWords(java.lang.String text)
Break text into word tokens.
|
boolean |
getMergeWhitespaceTokens()
Get merge whitespace tokens.
|
boolean |
getSplitAroundPeriods()
Get splitting around periods.
|
boolean |
getStoreWhitespaceTokens()
Get store whitespace tokens.
|
void |
setMergeWhitespaceTokens(boolean mergeWhitespaceTokens)
Set merge whitespace tokens.
|
void |
setSplitAroundPeriods(boolean splitAroundPeriods)
Set splitting around periods.
|
void |
setStoreWhitespaceTokens(boolean storeWhitespaceTokens)
Set store whitespace tokens.
|
addWordToSentence, findWordOffsets, getLogger, getPreTokenizer, isClosingQuote, isLetterOrSingleQuote, isMultipleHyphens, isSingleOpeningQuote, loadContractions, preprocessToken, setAbbreviations, setAposTokens, setLogger, setPreTokenizer, splitTokencloseclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitaddWordToSentence, close, findWordOffsets, getPreTokenizer, preprocessToken, setAbbreviations, setAposTokens, setPreTokenizercloseprotected java.util.Locale locale
protected boolean storeWhitespaceTokens
protected boolean mergeWhitespaceTokens
protected boolean splitAroundPeriods
protected com.ibm.icu.text.BreakIterator wordIterator
protected java.lang.String wordBreakRulesFileName
public ICU4JBreakIteratorWordTokenizer()
public ICU4JBreakIteratorWordTokenizer(java.util.Locale locale)
locale - Locale to use for tokenization.public boolean getStoreWhitespaceTokens()
getStoreWhitespaceTokens in interface CanTokenizeWhitespacepublic void setStoreWhitespaceTokens(boolean storeWhitespaceTokens)
setStoreWhitespaceTokens in interface CanTokenizeWhitespacepublic boolean getMergeWhitespaceTokens()
getMergeWhitespaceTokens in interface CanTokenizeWhitespacepublic void setMergeWhitespaceTokens(boolean mergeWhitespaceTokens)
setMergeWhitespaceTokens in interface CanTokenizeWhitespacepublic boolean getSplitAroundPeriods()
getSplitAroundPeriods in interface CanSplitAroundPeriodspublic void setSplitAroundPeriods(boolean splitAroundPeriods)
setSplitAroundPeriods in interface CanSplitAroundPeriodsprotected void createWordIterator()
public java.util.List<java.lang.String> extractWords(java.lang.String text)
extractWords in interface WordTokenizerextractWords in class AbstractWordTokenizertext - Text to break into word tokens.