Differences between revisions 2 and 3
Revision 2 as of 2004-04-08 16:08:46
Size: 1382
Editor: yakko
Comment:
Revision 3 as of 2004-04-08 16:09:01
Size: 1405
Editor: yakko
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:

See Also: StopWords

Back to ComputerTerms, InformationRetrieval

See Also: StopWords

Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens. Tokens are groups of characters with collective significance. This is the first stage of automated indexing and of the query processing.

Issues:

  • Digits: Number are not usually allowed, but we might allow words to contain digits, as long as they don't start with them. There are exceptions of course!
  • Hyphens: Consistancy is important, but there will be problems non the less.
  • ",._?`~" and other punctuation may be an integral part of the word. How we deal with this is important with respect to the kind of information that we are using!
  • Case: Usually just make everthing lower case!
  • Choosing delimiters is also very important: usually any white space and unrecognized punctuation or control characters are delimiters.

Implementation:

  1. Use alexical analyzer generator like lex: This is the best approach when the lexical analyzer is complicated.
  2. Write a lexical analyzer by hand - ad hoc: The worst solution, this will likely have subtle errors and may not be efficient.
  3. Write a lexical analyzer by hand as a finite state machine: Must be a good way, because this the the one our book chose to implement.

Back to ComputerTerms, InformationRetrieval

LexicalAnalysis (last edited 2004-04-08 16:09:01 by yakko)