Size: 1178
Comment:
|
Size: 1263
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 26: | Line 26: |
Here is Inverted File: | '''Here is Inverted File:''' |
Line 28: | Line 28: |
||Term||||Document Number|| | ||Term||||Document Number||||Frequency|| |
Line 42: | Line 42: |
To see the split to Lexicon and Posting file SEE: PostingFile |
Back to ComputerTerms
Topic: InformationRetrieval
= How to create an inverted file representation =
Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: Term, Document Number.
Step 2: Alphabetically sort the file by term
Step 3: Agregate the duplicates at this point for each Document. Now the file is formatted in three columns: Term, Document Number, Frequency
What you now have is an inverted file implementation.
This can be split into a Lexicon (Dictionary) and a Postings file.
Example
Document |
Keywords |
|
1 |
CS(2), UNL(3), Ferguson(5), Lincoln(2) |
|
2 |
Lincoln(3), CS(4), Computer(6) |
|
3 |
CS(3) |
|
4 |
university(2), UNL(2), CS(1) |
|
5 |
Ferguson(1) |
Here is Inverted File:
Term |
Document Number |
Frequency |
||
Computer |
2 |
6 |
||
CS |
1 |
2 |
||
CS |
2 |
4 |
||
CS |
3 |
3 |
||
CS |
4 |
1 |
||
Ferguson |
1 |
5 |
||
Ferguson |
5 |
1 |
||
Lincoln |
1 |
2 |
||
Lincoln |
2 |
3 |
||
university |
4 |
2 |
||
UNL |
1 |
3 |
||
UNL |
4 |
2 |
To see the split to Lexicon and Posting file SEE: PostingFile
Back to ComputerTerms