Back to ComputerTerms
Terms
This link provides a nice glossary of terms http://www.cs.jhu.edu/~weiss/glossary.html
Description
Examples: Library catalogs
Generally the data are organized as a collection of documents.
Querying
Querying of unstructured textual data is referred to as Information Retrieval. It covers the following areas:
- Querying based on keywords
- The relevance of documents to the query
- The analysis, classification and indexing of documents.
Queries are formed using keywords and logical connectives and, or, and not where the and connective is implicit.
Full Text --> All words in a document are keywords. We use term to refer to words in a document, since all words are keywords.
Given a document d, and a term t one way of defining the relavence r is
$$$r(d,t)=\log\left(1+\frac{n(d,t)}{n(d)}\right)$$$
n(d) denotes the number of terms in the document, and n(d,t) denotes the number of occurrences of term t in the document d.
KEY: In the information retrieval community, the relevance of a document to a term is referred to as term frequency, regardless of the exact formula used.
Inverse Document frequency defined as:
$$$IDF = \frac{1}{n(t)}$$$
where n(t) denotes the number of documents that contain the term t.
Here we have a low IDF if the word is found in many of the documents. If it is found in only a few, then it is probably a good term to use!
Thus the relavance of a document d to a set of terms Q is then defined as
$$$r(d,Q)=\sum_{t \in Q}\frac{r(d,t)}{n(t)}$$$
$$$r(d,Q)=\sum_{t \in Q}\frac{w(t) r(d,t)}{n(t)}$$$
where w(t) is a weight specified by the user.
KEY: Stop words are words that are not indexed such as and, or the, a etc.
Proximity: if a the terms occur close to each other in the document, the document would be ranked higher than if they occur far apart. We could (although we don't) modify the formula $$r(d,Q)$$ to take proximity into account.
Back to ComputerTerms