How to create an inverted file representation

Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: Term, Document Number. Note: we may optionally keep track of the location within the document as well if we are doing any proximity tests.

Step 2: Alphabetically sort the file by term

Step 3: Agregate the duplicates at this point for each Document. Now the file is formatted in three columns: Term, Document Number, Frequency

What you now have is an inverted file implementation.

This can be split into a Lexicon (Dictionary) and a Postings file.

Example

Document	Keywords
1	CS(2), UNL(3), Ferguson(5), Lincoln(2)
2	Lincoln(3), CS(4), Computer(6)
3	CS(3)
4	university(2), UNL(2), CS(1)
5	Ferguson(1)

Here is Inverted File:

Term	Document Number	Frequency
Computer	2	6
CS	1	2
CS	2	4
CS	3	3
CS	4	1
Ferguson	1	5
Ferguson	5	1
Lincoln	1	2
Lincoln	2	3
university	4	2
UNL	1	3
UNL	4	2

To see the split to Lexicon and Posting file SEE: PostingsFile

Back to ComputerTerms

InvertedFile

How to create an inverted file representation

Example