Size: 1118
Comment:
|
Size: 1191
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 28: | Line 28: |
||Term||||Document Number|| ||Computer||||2|| ||CS||||1|| ||CS||||2|| ||CS||||3|| ||CS||||4|| ||Ferguson||||1|| ||Ferguson||||5|| ||Lincoln||||1|| ||Lincoln||||2|| ||university||||4|| ||UNL||||1|| ||UNL||||4|| |
||Term||||Document Number||||Frequency|| ||Computer||||2||||6|| ||CS||||1||||2|| ||CS||||2||||4|| ||CS||||3||||3|| ||CS||||4||||1|| ||Ferguson||||1||||5|| ||Ferguson||||5||||1|| ||Lincoln||||1||||2|| ||Lincoln||||2||||3|| ||university||||4||||2|| ||UNL||||1||||3|| ||UNL||||4||||2|| |
Back to ComputerTerms
Topic: InformationRetrieval
= How to create an inverted file representation =
Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: Term, Document Number.
Step 2: Alphabetically sort the file by term
Step 3: Agregate the duplicates at this point for each Document. Now the file is formatted in three columns: Term, Document Number, Frequency
What you now have is an inverted file implementation.
This can be split into a Lexicon (Dictionary) and a Postings file.
Example
Document |
Keywords |
|
1 |
CS(2), UNL(3), Ferguson(5), Lincoln(2) |
|
2 |
Lincoln(3), CS(4), Computer(6) |
|
3 |
CS(3) |
|
4 |
university(2), UNL(2), CS(1) |
|
5 |
Ferguson(1) |
Here is Inverted File:
Term |
Document Number |
Frequency |
||
Computer |
2 |
6 |
||
CS |
1 |
2 |
||
CS |
2 |
4 |
||
CS |
3 |
3 |
||
CS |
4 |
1 |
||
Ferguson |
1 |
5 |
||
Ferguson |
5 |
1 |
||
Lincoln |
1 |
2 |
||
Lincoln |
2 |
3 |
||
university |
4 |
2 |
||
UNL |
1 |
3 |
||
UNL |
4 |
2 |
Back to ComputerTerms