⇤ ← Revision 1 as of 2004-03-26 21:23:09
Size: 867
Comment:
|
Size: 1092
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 26: | Line 26: |
||Term||||Document Number|| ||Computer||||2|| ||CS||||1|| ||CS||||2|| ||CS||||3|| ||CS||||4|| ||Ferguson||||1|| ||Ferguson||||5|| ||Lincoln||||1|| ||Lincoln||||2|| ||university||||4|| ||UNL||||1|| ||UNL||||4|| |
Back to ComputerTerms
Topic: InformationRetrieval
= How to create an inverted file representation =
Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: Term, Document Number.
Step 2: Alphabetically sort the file by term
Step 3: Agregate the duplicates at this point for each Document. Now the file is formatted in three columns: Term, Document Number, Frequency
What you now have is an inverted file implementation.
This can be split into a Lexicon (Dictionary) and a Postings file.
Example
Document |
Keywords |
|
1 |
CS(2), UNL(3), Ferguson(5), Lincoln(2) |
|
2 |
Lincoln(3), CS(4), Computer(6) |
|
3 |
CS(3) |
|
4 |
university(2), UNL(2), CS(1) |
|
5 |
Ferguson(1) |
Term |
Document Number |
|
Computer |
2 |
|
CS |
1 |
|
CS |
2 |
|
CS |
3 |
|
CS |
4 |
|
Ferguson |
1 |
|
Ferguson |
5 |
|
Lincoln |
1 |
|
Lincoln |
2 |
|
university |
4 |
|
UNL |
1 |
|
UNL |
4 |
Back to ComputerTerms