Web mining (structure mining)

Web Mining (Structure Mining)
Amir Fahmideh
Reza Baettela
Shayan Asadpoor

Why Mining Data?

•
•

•

Computerization and automated data gather
resulted in extremely large data repositories.
Scalability issues and desire for more
automation makes more traditional
techniques less effective.
Raw Data, Pattern, Knowledge

Definition of Web Mining
The application of data mining techniques to
discover patterns from the Web.
Web data consist of:
Web Content (text, images, records, etc)
Web Structure (hyperlinks, tags, etc)
Web Usage (http logs, app server logs, etc)

•
•
•

Web Mining Taxonomy
Web Mining
Content
Mining
Web
page
content
Mining

Search
result
mining

Structure
Mining

Usage
Mining

General
Access
Pattern
Tracking

Customized
usage
Tracking

What is web structure mining?
The structure of a typical Web graph consists
of Web pages as nodes, and hyperlinks as
edges connecting between two related pages

What is web structure mining?
Web Structure Mining can be is the process of
discovering structure information from the
Web
•This type of mining can be performed either
at the (intra-page) document level or at the
(inter-page) hyperlink level

•The research at the hyperlink level is also
called Hyperlink Analysis

Motivation to study Hyperlink
Structure
1. Hyperlinks serve two main purposes.
•
•

Pure Navigation.
Point to pages with authority on the same topic
of the page containing the link.

2. This can be used to retrieve useful
information from the web.

Web Structure Terminology
•Web-Graph: A directed graph that represent the web.
•Node: Each Web page is a node of the Web-graph.
•Link: Each hyperlink on the Web is a directed edge of
the Web-graph.
•In-degree: The in-degree of a node, p is the number of
distinct links that point to p.
•Out-degree: The out-degree of a node, p is the
number of distinct links originating at p that point to
other nodes.

Web Structure Terminology
•Directed Path: A sequence of links, starting from p
that can be followed to reach q.
•Shortest Path: Of all the paths between nodes p and
q, which has the shortest length, i.e. number of links on
it.
•Diameter: The maximum of all the shortest paths
between a pair of nodes p and q, for all pairs of nodes
p and q in the Web-graph.

Shape Of Web

The shape of the Chinese Web Graph

The shape of the Web Graph is more accurately
represented by a daisy-looking graph.

Example: web structure by language

Example: Components of web
structures by Language

Hyperlink Analysis Techniques
Knowledge
Models

Analysis
Scope And
Properties

Measures
And
Algorithms

Applications

Hyperlink Analysis Techniques
• Knowledge Models: The underlying representations that forms
the basis to carry out the application specific task
• Analysis Scope and Properties: The scope of analysis
specifies if the task is relevant to a single node or set of nodes
or the entire graph. The properties are the characteristics of
single node or the set of nodes or the entire web.
• Measures and Algorithms: The measures are the standards
for the properties such as quality, relevance or distance
between the nodes. Algorithms are designed to for efficient
computation of these measures
These three areas form the fundamental blocks for building
various Applications based on hyperlink analysis

Google’s Page Rank
Key Idea:
Rank of a web page
depends on the
rank of the web
pages pointing to it

Hubs and Authorities
Key ideas:
• Hubs and authorities are
„fans‟ and „centers‟ in a
bipartite core of a web
graph
• A good hub page is one that
points to many good
authority pages
• A good authority page is
one that is pointed to by
many good hub pages

HITS Algorithm
Let a is the vector of authority scores and h be the vector of hub
scores
a=[1,1,….1], h = [1,1,…..1] ;
do
a=ATh; (Authority update role)
h=Aa; (Hub update role)
Normalize a and h; (divided each node to square sum of other nodes)
while a and h do not converge (reach a convergence threshold)
a*= a;
h*= h;
return a*,h*
The vectors a* and h* represent the authority and hub weights

Information Scent
Key idea:
• a user at a given page “foraging” for information would
follow a link which “smells” of that information
• the probability of following a link depends on how strong
the “scent” is on that link
Distal Scent
(content from page at
the other end of link)

Proximal Cues
(Snippets, Graphics)

Scent
P1

P2

Conclusion
Web Structure is a useful source for extracting
information such as
•Quality of Web Page
•The authority of a page on a topic
•Ranking of web pages

•Interesting Web Structures
•Graph patterns like Co-citation, Social choice,
Complete bipartite graphs, etc.

•Web Page Classification
•Classifying web pages according to various topics

Conclusion
•Which pages to crawl
•Deciding which web pages to add to the collection of
web pages

•Finding Related Pages
•Given one relevant page, find all related pages

•Detection of duplicated pages
•Detection of neared-mirror sites to eliminate
duplication

Web mining (structure mining)

More Related Content

What's hot (20)

Similar to Web mining (structure mining) (20)

Recently uploaded (20)

Web mining (structure mining)

Editor's Notes