Boolean,vector space retrieval Models

Ms. T. Primya
Assistant Professor
Department of Computer Science and Engineering
Dr. N. G. P. Institute of Technology
Coimbatore

 A retrieval model can be a description of either the
computational process or the human process of
retrieval
 the process of choosing documents for retrieval
 the process by which information needs are first
articulated and then refined.

 Boolean Models
 Vector Space Models
 Probabilistic Models
 Models based on Belief nets
 Models based on Language Models

 A document is represented as a set of keywords.
 Index terms are considered to be either present or absent in a
document and to provide equal evidence with respect to information
needs.
 Queries are Boolean expressions of keywords, connected by AND,
OR, and NOT, including the use of brackets to indicate scope.
[[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
 Output: Document is relevant or not. No partial matches or ranking.

 User need: I’m interested in learning about vitamins
other than vitamin e that are anti-oxidants.
 User’s Boolean query: antioxidant AND vitamin
AND NOT vitamin e

 For each retrieval model, there explicit three
components:
 Document representation d
 Query q
 Ranking function R(d, q)

 An IR strategy is a technique by which a relevance
measure is obtained between a query and a document.
 Retrieve documents that make the query true.

 Boolean-Documents either match or don’t.
 Good for expert users with precise understanding of
their needs and of the collection.
 Also good for applications: Applications can easily
consume 1000s of results.
 Not good for the majority of users
 This is particularly true of web search.

 Boolean queries often have either too few or too many results.
Query 1
standard AND user AND dlink AND 650
→ 200,000 hits Feast!
Query 2
standard AND user AND dlink AND 650 AND no AND card AND found
→ 0 hits Famine!
 In Boolean retrieval, it takes a lot of skill to come up with a query that
produces a manageable number of hits.
 In ranked retrieval, “feast or famine” is less of a problem.
 Condition: Results that are more relevant are ranked higher than results that
are less relevant. (i.e., the ranking algorithm works.)

 A commonly used measure of overlap of two sets
 Let A and B be two sets
 Jaccard coeﬃcient:
jaccard(A,B) = |A∩B| |A∪B|
 jaccard(A,A) = 1
 jaccard(A,B) = 0 if A∩B = 0
 A and B don’t have to be the same size. Always
assigns a number between 0 and 1.

What is the query-document match score that the Jaccard
coeﬃcient computes for:
 Query
“ides of March”
 Document
“Caesar died in March”
jaccard(q,d) = 1/6

 It doesn’t consider term frequency (how many
occurrences a term has).
 Rare terms are more informative than frequent terms.
 Jaccard does not consider this information.

Advantages
 Can use very restrictive search
 Makes experienced users happy
 Clear formalism
 Simplicity
 It is still used in small scale searches like searching e-
mails, files from local hard drives

Disadvantages
 Simple queries do not work well.
 Complex query language, confusing to end users
 Difficult to control the number of documents
retrieved.
◦ All matched documents will be returned.
 Difficult to rank output.
◦ All matched documents logically satisfy the query.
 Difficult to perform relevance feedback.
◦ If a document is identified by the user as relevant or
irrelevant, how should the query be modified?

 Vector space model or term vector model is an
algebraic model for representing text documents (and
any objects, in general) as vectors of identifiers, such
as, for example, index terms.
 It is used in information filtering, information
retrieval, indexing and relevancy rankings.

The basis vectors correspond to the dimensions or
directions of the vector space

A vector is a point in a vector space and has length
(from the origin to the point) and direction

 A 2-dimensional vector can be written as [x, y]
 A 3-dimensional vector can be written as [x, y, z]

 Let V denote the size of the indexed vocabulary
 Any arbitrary span of text (i.e., a document, or a
query) can be represented as a vector in V-
dimensional space
 let’s assume three index terms: dog, bite, man (i.e.,
V=3)

1 = the term appears at least once
0 = the term does not appear

A query is a vector in V-dimensional space, where
V is the number of terms in the vocabulary

 The vector space model ranks documents based on
the vector-space similarity between the query vector
and the document vector
 There are many ways to compute the similarity
between two vectors
 One way is to compute the inner product

Multiply corresponding components and then sum
of those products

Pros and Cons
 The inner-product doesn’t account for the fact that
documents have widely varying lengths
 All things being equal, longer documents are more
likely to have the query-terms
 So, the inner-product favours long documents

 Document represented as a vector:
d =< d1; d2; …. dn >
 Query represented as a vector: q =< q1; q2;…. qn >
 Ranking function (retrieval status value):

 The cosine similarity between two vectors (or two
documents on the Vector Space) is a measure that
calculates the cosine of the angle between them.
 the cosine similarity equation is to solve the equation
of the dot product for the :
 The numerator is the inner product
 The denominator is the product of the two vector-
lengths
 Ranges from 0 to 1 (equals 1 if the vectors are
identical)

 a =[1, 2, 3]
 b =[4,-5,6]
a with b is dpab = 1*4 + 2*-5 + 3*6 = 12
a with itself is dpaa = 1*1 + 2*2 + 3*3 = 14
b with itself is dpbb = 4*4 + -5*-5 + 6*6 = 77
la = (dpaa) ½ = (14) ½ = 3.74; i.e., the length of a.
lb = (dpbb) ½ = (77)½ = 8.77; i.e., the length of b.
la*lb = (dpaa) ½ * (dpbb) ½ = 32.83;
i.e., the length product (lpab) of a and b.

dot product/length product ratio is

 The vector space model procedure can be divided
into three stages.
 The first stage is the document indexing where
content bearing terms are extracted from the
document text.
 The second stage is the weighting of the indexed
terms to enhance retrieval of document relevant to the
user.
 The last stage ranks the document with respect to the
query according to a similarity measure.

Boolean,vector space retrieval Models

More Related Content

What's hot (20)

Similar to Boolean,vector space retrieval Models (20)

Recently uploaded (20)

Boolean,vector space retrieval Models