SlideShare a Scribd company logo
Web Mining (Structure Mining)
Amir Fahmideh
Reza Baettela
Shayan Asadpoor
Why Mining Data?

•
•

•

Computerization and automated data gather
resulted in extremely large data repositories.
Scalability issues and desire for more
automation makes more traditional
techniques less effective.
Raw Data, Pattern, Knowledge
Definition of Web Mining
The application of data mining techniques to
discover patterns from the Web.
Web data consist of:
Web Content (text, images, records, etc)
Web Structure (hyperlinks, tags, etc)
Web Usage (http logs, app server logs, etc)

•
•
•
Web Mining Taxonomy
Web Mining
Content
Mining
Web
page
content
Mining

Search
result
mining

Structure
Mining

Usage
Mining

General
Access
Pattern
Tracking

Customized
usage
Tracking
What is web structure mining?
The structure of a typical Web graph consists
of Web pages as nodes, and hyperlinks as
edges connecting between two related pages
What is web structure mining?
Web Structure Mining can be is the process of
discovering structure information from the
Web
•This type of mining can be performed either
at the (intra-page) document level or at the
(inter-page) hyperlink level

•The research at the hyperlink level is also
called Hyperlink Analysis
Motivation to study Hyperlink
Structure
1. Hyperlinks serve two main purposes.
•
•

Pure Navigation.
Point to pages with authority on the same topic
of the page containing the link.

2. This can be used to retrieve useful
information from the web.
Web Structure Terminology
•Web-Graph: A directed graph that represent the web.
•Node: Each Web page is a node of the Web-graph.
•Link: Each hyperlink on the Web is a directed edge of
the Web-graph.
•In-degree: The in-degree of a node, p is the number of
distinct links that point to p.
•Out-degree: The out-degree of a node, p is the
number of distinct links originating at p that point to
other nodes.
Web Structure Terminology
•Directed Path: A sequence of links, starting from p
that can be followed to reach q.
•Shortest Path: Of all the paths between nodes p and
q, which has the shortest length, i.e. number of links on
it.
•Diameter: The maximum of all the shortest paths
between a pair of nodes p and q, for all pairs of nodes
p and q in the Web-graph.
Interesting Web Structures
Shape Of Web

The shape of the Chinese Web Graph

The shape of the Web Graph is more accurately
represented by a daisy-looking graph.
The Bow-Tie Model of the Web
Example: web structure by language
Example: Components of web
structures by Language
Example: Web structure
Hyperlink Analysis Techniques
Knowledge
Models

Analysis
Scope And
Properties

Measures
And
Algorithms

Applications
Hyperlink Analysis Techniques
• Knowledge Models: The underlying representations that forms
the basis to carry out the application specific task
• Analysis Scope and Properties: The scope of analysis
specifies if the task is relevant to a single node or set of nodes
or the entire graph. The properties are the characteristics of
single node or the set of nodes or the entire web.
• Measures and Algorithms: The measures are the standards
for the properties such as quality, relevance or distance
between the nodes. Algorithms are designed to for efficient
computation of these measures
These three areas form the fundamental blocks for building
various Applications based on hyperlink analysis
Google’s Page Rank
Key Idea:
Rank of a web page
depends on the
rank of the web
pages pointing to it
Google’s Page Rank
Hubs and Authorities
Key ideas:
• Hubs and authorities are
„fans‟ and „centers‟ in a
bipartite core of a web
graph
• A good hub page is one that
points to many good
authority pages
• A good authority page is
one that is pointed to by
many good hub pages
HITS Algorithm
Let a is the vector of authority scores and h be the vector of hub
scores
a=[1,1,….1], h = [1,1,…..1] ;
do
a=ATh; (Authority update role)
h=Aa; (Hub update role)
Normalize a and h; (divided each node to square sum of other nodes)
while a and h do not converge (reach a convergence threshold)
a*= a;
h*= h;
return a*,h*
The vectors a* and h* represent the authority and hub weights
Information Scent
Key idea:
• a user at a given page “foraging” for information would
follow a link which “smells” of that information
• the probability of following a link depends on how strong
the “scent” is on that link
Distal Scent
(content from page at
the other end of link)

Proximal Cues
(Snippets, Graphics)

Scent
P1

P2
Conclusion
Web Structure is a useful source for extracting
information such as
•Quality of Web Page
•The authority of a page on a topic
•Ranking of web pages

•Interesting Web Structures
•Graph patterns like Co-citation, Social choice,
Complete bipartite graphs, etc.

•Web Page Classification
•Classifying web pages according to various topics
Conclusion
•Which pages to crawl
•Deciding which web pages to add to the collection of
web pages

•Finding Related Pages
•Given one relevant page, find all related pages

•Detection of duplicated pages
•Detection of neared-mirror sites to eliminate
duplication
Any question?
Thanks for your attension.

More Related Content

PPTX
Web crawler
ODP
Web content mining
PPTX
web mining
PPT
Web Engineering
ODP
Web Content Mining
PDF
Data Streaming For Big Data
PPTX
PPTX
Web Mining & Text Mining
Web crawler
Web content mining
web mining
Web Engineering
Web Content Mining
Data Streaming For Big Data
Web Mining & Text Mining

What's hot (20)

PDF
Web mining slides
PPTX
3 Data Mining Tasks
PDF
Lecture6 introduction to data streams
PPTX
Probabilistic information retrieval models & systems
PPTX
WEB BASED INFORMATION RETRIEVAL SYSTEM
PPTX
Event In JavaScript
PPTX
Distributed file system
PPTX
Classification in data mining
PPT
Map reduce in BIG DATA
PPTX
Information retrieval introduction
PPTX
HTTP request and response
PPTX
Data Mining: Graph mining and social network analysis
PPTX
Introduction to Hadoop
PDF
JavaScript - Chapter 12 - Document Object Model
PDF
Sequential Pattern Mining and GSP
PPTX
Mining Data Streams
PPTX
Information retrieval s
PPTX
REST & RESTful Web Services
PPTX
Tdm information retrieval
PPTX
Big Data Analytics with Hadoop
Web mining slides
3 Data Mining Tasks
Lecture6 introduction to data streams
Probabilistic information retrieval models & systems
WEB BASED INFORMATION RETRIEVAL SYSTEM
Event In JavaScript
Distributed file system
Classification in data mining
Map reduce in BIG DATA
Information retrieval introduction
HTTP request and response
Data Mining: Graph mining and social network analysis
Introduction to Hadoop
JavaScript - Chapter 12 - Document Object Model
Sequential Pattern Mining and GSP
Mining Data Streams
Information retrieval s
REST & RESTful Web Services
Tdm information retrieval
Big Data Analytics with Hadoop
Ad

Similar to Web mining (structure mining) (20)

PPTX
Web Mining.pptx
PPTX
Gaurav web mining
PPTX
DC presentation 1
PPTX
Web mining
PPTX
Data mining and warehouse by dr D. R. Patil sir
PPTX
The Anatomy of a Large-Scale Hypertextual Web Search Engine
PPTX
webcrawler.pptx
PPTX
Web mining
PPTX
Web Mining
PPTX
IRT Unit_4.pptx
PPTX
unit 5 WEB RETRIEVAL AND WEB CRAWLING
PDF
Cityofdenton.com
PPTX
Semantic web
PDF
IRJET- Page Ranking Algorithms – A Comparison
PPTX
INFORMATION RETRIEVAL IN WEB INTELLIGENCE
PPTX
Graph theory in Search engines and web connectivity.pptx
PDF
TunesKit Video Cutter 3.0.0.54 for MacOS Free
PDF
Neo4j workshop at GraphSummit London 14 Nov 2023.pdf
PDF
Adobe Acrobat Reader: Edit PDF 25.9.0.87410 APK
PDF
Download- Enscape Crack + Activvation key
Web Mining.pptx
Gaurav web mining
DC presentation 1
Web mining
Data mining and warehouse by dr D. R. Patil sir
The Anatomy of a Large-Scale Hypertextual Web Search Engine
webcrawler.pptx
Web mining
Web Mining
IRT Unit_4.pptx
unit 5 WEB RETRIEVAL AND WEB CRAWLING
Cityofdenton.com
Semantic web
IRJET- Page Ranking Algorithms – A Comparison
INFORMATION RETRIEVAL IN WEB INTELLIGENCE
Graph theory in Search engines and web connectivity.pptx
TunesKit Video Cutter 3.0.0.54 for MacOS Free
Neo4j workshop at GraphSummit London 14 Nov 2023.pdf
Adobe Acrobat Reader: Edit PDF 25.9.0.87410 APK
Download- Enscape Crack + Activvation key
Ad

Recently uploaded (20)

PDF
Complications of Minimal Access Surgery at WLH
PDF
RMMM.pdf make it easy to upload and study
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Trump Administration's workforce development strategy
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Complications of Minimal Access Surgery at WLH
RMMM.pdf make it easy to upload and study
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
A systematic review of self-coping strategies used by university students to ...
Abdominal Access Techniques with Prof. Dr. R K Mishra
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Trump Administration's workforce development strategy
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
STATICS OF THE RIGID BODIES Hibbelers.pdf
Cell Structure & Organelles in detailed.
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Microbial disease of the cardiovascular and lymphatic systems
Final Presentation General Medicine 03-08-2024.pptx
Anesthesia in Laparoscopic Surgery in India
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
O7-L3 Supply Chain Operations - ICLT Program
VCE English Exam - Section C Student Revision Booklet
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF

Web mining (structure mining)

  • 1. Web Mining (Structure Mining) Amir Fahmideh Reza Baettela Shayan Asadpoor
  • 2. Why Mining Data? • • • Computerization and automated data gather resulted in extremely large data repositories. Scalability issues and desire for more automation makes more traditional techniques less effective. Raw Data, Pattern, Knowledge
  • 3. Definition of Web Mining The application of data mining techniques to discover patterns from the Web. Web data consist of: Web Content (text, images, records, etc) Web Structure (hyperlinks, tags, etc) Web Usage (http logs, app server logs, etc) • • •
  • 4. Web Mining Taxonomy Web Mining Content Mining Web page content Mining Search result mining Structure Mining Usage Mining General Access Pattern Tracking Customized usage Tracking
  • 5. What is web structure mining? The structure of a typical Web graph consists of Web pages as nodes, and hyperlinks as edges connecting between two related pages
  • 6. What is web structure mining? Web Structure Mining can be is the process of discovering structure information from the Web •This type of mining can be performed either at the (intra-page) document level or at the (inter-page) hyperlink level •The research at the hyperlink level is also called Hyperlink Analysis
  • 7. Motivation to study Hyperlink Structure 1. Hyperlinks serve two main purposes. • • Pure Navigation. Point to pages with authority on the same topic of the page containing the link. 2. This can be used to retrieve useful information from the web.
  • 8. Web Structure Terminology •Web-Graph: A directed graph that represent the web. •Node: Each Web page is a node of the Web-graph. •Link: Each hyperlink on the Web is a directed edge of the Web-graph. •In-degree: The in-degree of a node, p is the number of distinct links that point to p. •Out-degree: The out-degree of a node, p is the number of distinct links originating at p that point to other nodes.
  • 9. Web Structure Terminology •Directed Path: A sequence of links, starting from p that can be followed to reach q. •Shortest Path: Of all the paths between nodes p and q, which has the shortest length, i.e. number of links on it. •Diameter: The maximum of all the shortest paths between a pair of nodes p and q, for all pairs of nodes p and q in the Web-graph.
  • 11. Shape Of Web The shape of the Chinese Web Graph The shape of the Web Graph is more accurately represented by a daisy-looking graph.
  • 12. The Bow-Tie Model of the Web
  • 13. Example: web structure by language
  • 14. Example: Components of web structures by Language
  • 16. Hyperlink Analysis Techniques Knowledge Models Analysis Scope And Properties Measures And Algorithms Applications
  • 17. Hyperlink Analysis Techniques • Knowledge Models: The underlying representations that forms the basis to carry out the application specific task • Analysis Scope and Properties: The scope of analysis specifies if the task is relevant to a single node or set of nodes or the entire graph. The properties are the characteristics of single node or the set of nodes or the entire web. • Measures and Algorithms: The measures are the standards for the properties such as quality, relevance or distance between the nodes. Algorithms are designed to for efficient computation of these measures These three areas form the fundamental blocks for building various Applications based on hyperlink analysis
  • 18. Google’s Page Rank Key Idea: Rank of a web page depends on the rank of the web pages pointing to it
  • 20. Hubs and Authorities Key ideas: • Hubs and authorities are „fans‟ and „centers‟ in a bipartite core of a web graph • A good hub page is one that points to many good authority pages • A good authority page is one that is pointed to by many good hub pages
  • 21. HITS Algorithm Let a is the vector of authority scores and h be the vector of hub scores a=[1,1,….1], h = [1,1,…..1] ; do a=ATh; (Authority update role) h=Aa; (Hub update role) Normalize a and h; (divided each node to square sum of other nodes) while a and h do not converge (reach a convergence threshold) a*= a; h*= h; return a*,h* The vectors a* and h* represent the authority and hub weights
  • 22. Information Scent Key idea: • a user at a given page “foraging” for information would follow a link which “smells” of that information • the probability of following a link depends on how strong the “scent” is on that link Distal Scent (content from page at the other end of link) Proximal Cues (Snippets, Graphics) Scent P1 P2
  • 23. Conclusion Web Structure is a useful source for extracting information such as •Quality of Web Page •The authority of a page on a topic •Ranking of web pages •Interesting Web Structures •Graph patterns like Co-citation, Social choice, Complete bipartite graphs, etc. •Web Page Classification •Classifying web pages according to various topics
  • 24. Conclusion •Which pages to crawl •Deciding which web pages to add to the collection of web pages •Finding Related Pages •Given one relevant page, find all related pages •Detection of duplicated pages •Detection of neared-mirror sites to eliminate duplication
  • 26. Thanks for your attension.

Editor's Notes

  • #6: Web-Graph: The webgraph describes the directed links between pages of the World Wide Web. A graph, in general, consists of several vertices, some pairs connected by edges. In a directed graph, edges are directed lines or arcs. The webgraph is a directed graph, whose vertices correspond to the pages of the WWW, and a directed edge connects page X to page Y if there exists a hyperlink on page X, referring to page Y.in-degree: The number of edges coming into a vertex in a directed graph.out-degree: The number of edges going out of a vertex in a directed graph.
  • #8: Co-Citation: قابل استناد می تواند سندی را به استناد یکسری از اسناد دیگر پیدا کرد و یا به آن رسید
  • #9: The name was actually given by Andrei Broderwhen with his colleagues were trying to make sense of the collected Web data.we can present a Web Graph recognition algorithm which can be applied on any directed graph and recognize its Bowtie regions, if there exist.SCC (Core): Strongly Connected Core IN is composed of those nodes that are on a directed path that ends on a node in CORE, but that they themselves are not part of the CORE.OUT is composed of those nodes that are on a directed path that starts from a node in CORE, but that they themselves are not part of the CORE.ISLANDS are nodes completely disconnected from CORE, IN and OUT, that is, there is no directed path that connects them to the Bowtie.TENDRILS come in three flavorsTENDRILS-IN are nodes for which there is a directed path from IN, but there is no directed path from them to any other component.TENDRILS-OUT are nodes that are on a directed path to a node in OUT, but no path leads from them to any other component.TUBES are nodes that are on a path from a node in IN to a node in OUT, and there is no path that connects them to CORE. If their connecting paths were to be broken, they would end up as one or more simple TENDRILS.if there is no CORE, there is no Bowtie graph. But it is practically impossible for the collection of interconnected web pages in the Web Graph not to have an SCC. Indeed, given a collection of web pages that are allowed to link one or more times to any other web page of the collection they choose to, an SCC is bound to arise, and with it a Bowtie.
  • #10: مدل دانشدانشی که کار های خاص یک کاربرد را مشخص می کندحیطه تحلیل و مشخصاتحیطه کاری الگوریتم می تواند یک نود از گراف و یا مجموعی از نود ها باشد و همچنین می تواند شامل کل گراف باشدمشخصات خصوصیات یک نود، مجموعه ای از نود ها و یا همچنین کل گراف می باشدالگوریتم و خصیصه های اندازه گیریاندازه هایی استاندارد برای بعضی از خصیصه ها مانند کیفیت، ارتباطات و یا فاصله بین نود ها الگوریتم طراحی شدن اند برای محاسبه کارآمد این مقیاس ها
  • #11: http://en.wikipedia.org/wiki/PageRankOutDeg(P1) = تعداد لینک های خروجی هر نود Damping: تعدیل = 0.85N: number of documents
  • #12: Hyperlink-Induced Topic Search developed by john KleinbergCertain web page known as hub The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.
  • #13: Authority and hub values are defined in terms of one another in a mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that pageHITS, like Page and Brin's PageRank, is an iterative algorithm based on the linkage of the documents on the web. However it does have some major differences:It is query dependent, that is, the (Hubs and Authority) scores resulting from the link analysis are influenced by the search terms;As a corollary, it is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing.It is not commonly used by search engines. (Though a similar algorithm was said to be used by Teoma, which was acquired by Ask.com.)It computes two scores per document, hub and authority, as opposed to a single score;It is processed on a small subset of ‘relevant’ documents (a 'focused subgraph' or base set), not all documents as was the case with PageRank.e. A hub value is the sum of the scaled authority values of the pages it points to.Normalize : divided each node to square sum of other nodes
  • #14: Foraging: کاوشگریScent: رایحهکاربران صفحاتی رو بازدید می کنند که مطبوع ترین رایحه را داشته باشد احتمال پیگیری یک لینک به میزان محبوبیت آن لینک می باشد