Web mining slides

Web Mining Outline
• Goal –
– Examine the use of data mining on the World Wide
Web.

• Outline -
– Introduction.
– Web Content Mining.
– Web Structure Mining.
– Web Usage Mining.

Mahavir Advaya - Ruia College 2

Web Mining Issues
• Size –
– >350 million pages (1999).
– Grows at about 1 million pages a day.
– Google indexes 3 billion documents.

• Diverse types of data.


Web Data
• Web pages.
• Intra-page structures.
• Inter-page structures.
• Usage data.
• Supplemental data –
– Profiles.
– Registration information.
– Cookies.


Web Mining Taxonomy


Web Content Mining
• Extends work of basic search engines.

• Search Engines –
– IR application.
– Keyword based.
– Similarity between query and document.
– Crawlers.
– Indexing.
– Profiles.
– Link analysis.


Crawlers (Spider)
• Robot (spider), a program, traverses the hypertext structure in
the Web.
– Collect information from visited pages.
– Used to construct indexes for search engines.

• Traditional Crawler – visits entire Web (?) and replaces index.

• Periodic Crawler – visits portions of the Web and updates subset
of index.

• Incremental Crawler – selectively searches the Web and
incrementally modifies index.

• Focused Crawler – visits pages related to a particular subject.


Focused Crawler
• Only visit links from a page if that page is determined to
be relevant.

• Classifier is static after learning phase.

• Components –
– Hypertext Classifier which assigns relevance score to
each page based on crawl topic.

– Distiller to identify hub pages.

– Crawler visits pages to based on crawler and distiller
scores.


Focused Crawler
• Classifier to related documents to topics.

• Classifier also determines how useful outgoing links are.

• Hub Pages contain links to many relevant pages. Must
be visited even if not high relevance score.


Focused Crawler


Context Focused Crawler
• Context Graph –
– Context graph created for each seed document .
– Root is the seed document.
– Nodes at each level show documents with links to
documents at next higher level.
– Updated during crawl itself .

• Approach –
1. Construct context graph and classifiers using seed
documents as training data.
2. Perform crawling using classifiers and context graph
created.


Context Graph


Virtual Web View
• Approach to handle unstructured data.
• Multiple Layered DataBase (MLDB) built on top of the
Web.
• Each layer of the database is more generalized (and
smaller) and centralized than the one beneath it.
• Upper layers of MLDB are structured and can be accessed
with SQL type queries.
• Does not require the use of spiders (Crawlers).
• Translation tools convert Web documents to XML.
• Extraction tools extract desired information to place in first
layer of MLDB. Convert web document to XML.
• Higher levels contain more summarized data obtained
through generalizations of the lower levels.


WebML
• Web data Mining Query Language.

• Provides data mining operations on MLDB.

• Major feature – four operations –
– COVERS: one concept covers another if it is higher in
the hierarchy.
– COVERED BY: reverse of COVERS, reverses the
descendents.
– LIKE: concept is a synonym.
– CLOSE TO: One concept is close to another if it is a
sibling in the hierarchy.


WebML
• Example –
– Find all the documents at the level of
www.engr.smu.edu.

• Query –
SELECT *
FROM document in ‘ ‘ www.engr.smu.edu ‘ ‘
WHERE ONE OF keywords COVERS ‘ ‘ cat ‘ ‘


Personalization
• Example of Web Content Mining.
• Web access or contents tuned to better fit the desires of each
user.
• With personalization, advertisements to be sent to the customers
based on specific knowledge.
• Goal – Make the customer purchase something.

• Three basic types –
– Manual techniques – identify user’s preferences based on
profiles or demographics.
– Collaborative filtering identifies preferences based on ratings
from similar users.
– Content based filtering retrieves pages based on similarity
between pages and user profiles.


Web Structure Mining
• Create a model of the Web organization or a portion of
it.

• Mine structure (links, graph) of the Web.

• Techniques –
– PageRank.
– CLEVER.

• May be combined with content mining to more
effectively retrieve important pages.


PageRank
• Used by Google.

• Prioritize pages returned from search by looking at Web
structure.

• Importance of page is calculated based on number of
pages which point to it – Backlinks.

• Weighting is used to provide more importance to
backlinks coming form important pages.


PageRank (cont’d)

• PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
– PR(i): PageRank for a page i which points to
target page p.

– Ni: number of links coming out of page i.

– c: constant value between 0 and 1 used for
normalization.


CLEVER
• System developed by IBM.
• Finding both authoritative and hub pages.

• Authoritative Pages –
– Authors define an authority as the “best source” for
the request.
o Highly important pages.
o Best source for requested information.

• Hub Pages –
– Contain links to highly important pages.
– Clever, identifies authoritative and hub pages by
creating weights.


HITS
• Hyperlink-Induces Topic Search.
• Finds Hubs and Authoritative Pages.

• Two components –
– Based on a set of keywords, find set of relevant
pages – R.
– Identify hub and authority pages for these.
o Expand R to a base set, B, of pages linked to or
from R.
o Calculate weights for authorities and hubs.

• Pages with highest ranks in R are returned.

HITS Algorithm


Web Usage Mining
• Extends work of basic search engines.

• Performs mining on Web usage data or Web logs.

• Search Engines –
– IR application.
– Keyword based.
– Similarity between query and document.
– Crawlers.
– Indexing.
– Profiles.
– Link analysis.


Web Usage Mining Applications
• Personalization – tracking of previously accessed pages.
• Determining frequent access behavior for users.
• Improve structure of a site’s Web pages.
• Aid in caching and prediction of future page references.
• Improve design of individual pages.
• Improve effectiveness of e-commerce (sales and
advertising).
• Gathering Statistics – considering accessed pages may
or may not be viewed as part web mining .


Web Usage Mining Activities
• Preprocessing Web log –
– Cleanse.
– Remove extraneous information.
– Sessionize –
o Session: Sequence of pages referenced by one user at a
sitting.

• Pattern Discovery –
– Count patterns that occur in sessions.
– Pattern is sequence of pages references in session.
– Similar to association rules –
o Transaction: session.
o Itemset: pattern (or subset).
o Order is important.

• Pattern Analysis.

ARs in Web Mining
• Web Mining –
– Content.
– Structure.
– Usage.

• Frequent patterns of sequential page references in Web
searching.

• Uses –
– Caching
– Clustering users
– Develop user profiles
– Identify important pages

Web Usage Mining Issues
• Identification of exact user not possible.

• Exact sequence of pages referenced by a user not
possible due to caching.

• Session not well defined.

• Security, privacy, and legal issues.


Web Log Cleansing
• Replace source IP address with unique but non-
identifying ID.

• Replace exact URL of pages referenced with unique but
non-identifying ID.

• Delete error records and records containing not page
data (such as figures and code).


Sessionizing
• Divide Web log into sessions.

• Two common techniques –
– Number of consecutive page references from a source
IP address occurring within a predefined time interval
(e.g. 25 minutes).

– All consecutive page references from a source IP
address where the interclick time is less than a
predefined threshold.


Data Structures
• Keep track of patterns identified during Web usage
mining process.

• Common techniques –
– Trie.
– Suffix Tree.
– Generalized Suffix Tree.
– WAP Tree.


Trie vs. Suffix Tree
• Trie –
– Rooted tree.
– Edges labeled which character (page) from pattern.
– Path from root to leaf represents pattern.

• Suffix Tree –
– Single child collapsed with parent. Edge contains
labels of both prior edges.


Trie and Suffix Tree

A

L

O

G
ALOG


Generalized Suffix Tree
• Suffix tree for multiple sessions.

• Contains patterns from all sessions.

• Maintains count of frequency of occurrence of a pattern
in the node.

• WAP Tree –
– Web Access Pattern.
– Compressed version of generalized suffix tree.
– Tree stores sequences and their counts.


Types of Patterns
• Algorithms have been developed to discover different
types of patterns.

• Properties –
– Ordered – Characters (pages) must occur in the exact
order in the original session.
– Duplicates – Duplicate characters are allowed in the
pattern.
– Consecutive – All characters in pattern must occur
consecutive in given session.
– Maximal – Not subsequence of another pattern.


Pattern Types
• Association Rules.
• Episodes.
• Sequential Patterns.
• Forward Sequences.
• Maximal Frequent Sequences.


Questions???
• Write a short note on Web Content Mining.
• What is Web Mining? Give web mining taxonomy.
• What do you mean by Web Usage Mining? Explain rule with
examples.
• Write a short note on Harvest System.
• Define crawler. State and explain different types of crawlers.
• Write a short note on crawlers.
• Give taxonomy of web mining activities. For what purpose web
usage mining is used? What activities are involved in web usage
mining?
• What do you understand by the term “Web Usage Mining”.
• Explain the term crawlers in web mining.
• Discuss the importance of establishing a standardized WebML.
• Write a short note on web structure mining.


Web mining slides

More Related Content

What's hot (20)

Similar to Web mining slides (20)

Recently uploaded (20)

Web mining slides