public abstract class AbstractPreTokenizer extends IsCloseableObject implements PreTokenizer, UsesLogger
| Modifier and Type | Field and Description |
|---|---|
protected static java.lang.String |
alwaysSeparators
Pattern to match characters which are always separators.
|
protected static PatternReplacer |
alwaysSeparatorsReplacer
Always Separators replacer pattern.
|
protected static java.lang.String |
asterisks
Pattern to match one or more asterisk.
|
protected static java.lang.String |
commaSeparator
Pattern to match comma as a separator.
|
protected static PatternReplacer |
commaSeparatorReplacer
Comma separator replacer pattern.
|
protected static java.lang.String |
hyphens
Pattern to match two or more hyphens in a row.
|
protected Logger |
logger
Logger used for output.
|
protected static java.lang.String |
periods
Pattern to match three or more periods.
|
| Constructor and Description |
|---|
AbstractPreTokenizer()
Create a preTokenizer.
|
| Modifier and Type | Method and Description |
|---|---|
Logger |
getLogger()
Get the logger.
|
java.lang.String |
pretokenize(java.lang.String line)
Prepare text for tokenization.
|
void |
setLogger(Logger logger)
Set the logger.
|
closeclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitcloseprotected static final java.lang.String periods
protected static final java.lang.String asterisks
protected static final java.lang.String hyphens
protected static final java.lang.String commaSeparator
protected Logger logger
protected static final java.lang.String alwaysSeparators
Unicode ? (BLACKCIRCLE) is the dot character which marks character lacunae. This is not a token separator. Neither is Unicode • (SOLIDCIRCLE) which was used in the old EEBO format TCP files to mark character lacunae.
Unicode ?, the non-breaking hyphen, is not treated as a token separator.
Unicode ? (DEGREES_MARK) is degrees quote symbol. Unicode ? (MINUTES_MARK) is minutes quote symbol. Unicode ? (SECONDS_MARK) is seconds quote symbol. These are not token separators.
Unicode ‘ (LSQUOTE) is left single curly quote. Unicode ’ (RSQUOTE) is right single curly quote. These may or may not be token separators. It is up to the word tokenizer to decide.
Unicode “ (LDQUOTE) is left double curly quote. Unicode ” (RDQUOTE) is right double curly quote. These are token separators.
protected static PatternReplacer alwaysSeparatorsReplacer
protected static PatternReplacer commaSeparatorReplacer
public Logger getLogger()
getLogger in interface UsesLoggerpublic void setLogger(Logger logger)
setLogger in interface UsesLoggerlogger - The logger.public java.lang.String pretokenize(java.lang.String line)
pretokenize in interface PreTokenizerline - The text to prepare for tokenization,