Back to ComputerTerms
Topic: InformationRetrieval
How to create an inverted file representation
Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: Term, Document Number. Note: we may optionally keep track of the location within the document as well if we are doing any proximity tests.
Step 2: Alphabetically sort the file by term
Step 3: Agregate the duplicates at this point for each Document. Now the file is formatted in three columns: Term, Document Number, Frequency
What you now have is an inverted file implementation.
This can be split into a Lexicon (Dictionary) and a Postings file.
Example
Document |
Keywords |
|
1 |
CS(2), UNL(3), Ferguson(5), Lincoln(2) |
|
2 |
Lincoln(3), CS(4), Computer(6) |
|
3 |
CS(3) |
|
4 |
university(2), UNL(2), CS(1) |
|
5 |
Ferguson(1) |
Here is Inverted File:
Term |
Document Number |
Frequency |
||
Computer |
2 |
6 |
||
CS |
1 |
2 |
||
CS |
2 |
4 |
||
CS |
3 |
3 |
||
CS |
4 |
1 |
||
Ferguson |
1 |
5 |
||
Ferguson |
5 |
1 |
||
Lincoln |
1 |
2 |
||
Lincoln |
2 |
3 |
||
university |
4 |
2 |
||
UNL |
1 |
3 |
||
UNL |
4 |
2 |
To see the split to Lexicon and Posting file SEE: PostingsFile
Back to ComputerTerms