Size: 867
Comment:
|
← Revision 9 as of 2004-04-08 00:24:04 ⇥
Size: 1379
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
= How to create an inverted file representation = | = How to create an inverted file representation = |
Line 7: | Line 7: |
Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: '''Term, Document Number'''. | Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: '''Term, Document Number'''. Note: we may optionally keep track of the location within the document as well if we are doing any proximity tests. |
Line 26: | Line 26: |
'''Here is Inverted File:''' ||Term||||Document Number||||Frequency|| ||Computer||||2||||6|| ||CS||||1||||2|| ||CS||||2||||4|| ||CS||||3||||3|| ||CS||||4||||1|| ||Ferguson||||1||||5|| ||Ferguson||||5||||1|| ||Lincoln||||1||||2|| ||Lincoln||||2||||3|| ||university||||4||||2|| ||UNL||||1||||3|| ||UNL||||4||||2|| To see the split to Lexicon and Posting file SEE: PostingsFile |
Back to ComputerTerms
Topic: InformationRetrieval
How to create an inverted file representation
Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: Term, Document Number. Note: we may optionally keep track of the location within the document as well if we are doing any proximity tests.
Step 2: Alphabetically sort the file by term
Step 3: Agregate the duplicates at this point for each Document. Now the file is formatted in three columns: Term, Document Number, Frequency
What you now have is an inverted file implementation.
This can be split into a Lexicon (Dictionary) and a Postings file.
Example
Document |
Keywords |
|
1 |
CS(2), UNL(3), Ferguson(5), Lincoln(2) |
|
2 |
Lincoln(3), CS(4), Computer(6) |
|
3 |
CS(3) |
|
4 |
university(2), UNL(2), CS(1) |
|
5 |
Ferguson(1) |
Here is Inverted File:
Term |
Document Number |
Frequency |
||
Computer |
2 |
6 |
||
CS |
1 |
2 |
||
CS |
2 |
4 |
||
CS |
3 |
3 |
||
CS |
4 |
1 |
||
Ferguson |
1 |
5 |
||
Ferguson |
5 |
1 |
||
Lincoln |
1 |
2 |
||
Lincoln |
2 |
3 |
||
university |
4 |
2 |
||
UNL |
1 |
3 |
||
UNL |
4 |
2 |
To see the split to Lexicon and Posting file SEE: PostingsFile
Back to ComputerTerms