⇤ ← Revision 1 as of 2004-04-08 15:29:41
Size: 354
Comment:
|
Size: 1382
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
Issues: * Digits: Number are not usually allowed, but we might allow words to contain digits, as long as they don't start with them. There are exceptions of course! * Hyphens: Consistancy is important, but there will be problems non the less. * ",._?`~" and other punctuation may be an integral part of the word. How we deal with this is important with respect to the kind of information that we are using! * Case: Usually just make everthing lower case! * Choosing delimiters is also very important: usually any white space and unrecognized punctuation or control characters are delimiters. Implementation: 1. Use alexical analyzer generator like lex: This is the best approach when the lexical analyzer is complicated. 1. Write a lexical analyzer by hand - ad hoc: The worst solution, this will likely have subtle errors and may not be efficient. 1. Write a lexical analyzer by hand as a finite state machine: Must be a good way, because this the the one our book chose to implement. |
Back to ComputerTerms, InformationRetrieval
Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens. Tokens are groups of characters with collective significance. This is the first stage of automated indexing and of the query processing.
Issues:
- Digits: Number are not usually allowed, but we might allow words to contain digits, as long as they don't start with them. There are exceptions of course!
- Hyphens: Consistancy is important, but there will be problems non the less.
- ",._?`~" and other punctuation may be an integral part of the word. How we deal with this is important with respect to the kind of information that we are using!
- Case: Usually just make everthing lower case!
- Choosing delimiters is also very important: usually any white space and unrecognized punctuation or control characters are delimiters.
Implementation:
- Use alexical analyzer generator like lex: This is the best approach when the lexical analyzer is complicated.
- Write a lexical analyzer by hand - ad hoc: The worst solution, this will likely have subtle errors and may not be efficient.
- Write a lexical analyzer by hand as a finite state machine: Must be a good way, because this the the one our book chose to implement.
Back to ComputerTerms, InformationRetrieval