SlideShare a Scribd company logo
Annotating Search Results from Web Databases
ABSTRACT:
An increasing number of databases have become web accessible through HTML form-based
search interfaces. The data units returned from the underlying database are usually encoded into
the result pages dynamically for human browsing. For the encoded data units to be machine
process able, which is essential for many applications such as deep web data collection and
Internet comparison shopping, they need to be extracted out and assigned meaningful labels. In
this paper, we present an automatic annotation approach that first aligns the data units on a
result page into different groups such that the data in the same group have the same semantic.
Then, for each group we annotate it from different aspects and aggregate the different
annotations to predict a final annotation label for it. An annotation wrapper for the search site is
automatically constructed and can be used to annotate new result pages from the same web
database. Our experiments indicate that the proposed approach is highly effective.
EXISTING SYSTEM:
In this existing system, a data unit is a piece of text that semantically represents one concept of
an entity. It corresponds to the value of a record under an attribute. It is different from a text
node which refers to a sequence of text surrounded by a pair of HTML tags. It describes the
relationships between text nodes and data units in detail. In this paper, we perform data unit
GLOBALSOFT TECHNOLOGIES
IEEE PROJECTS & SOFTWARE DEVELOPMENTS
IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE
BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS
CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401
Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com
level annotation. There is a high demand for collecting data of interest from multiple WDBs.
For example, once a book comparison shopping system collects multiple result records from
different book sites, it needs to determine whether any two SRRs refer to the same book.
DISADVANTAGES OF EXISTING SYSTEM:
If ISBNs are not available, their titles and authors could be compared. The system also needs to
list the prices offered by each site. Thus, the system needs to know the semantic of each data
unit. Unfortunately, the semantic labels of data units are often not provided in result pages. For
instance, no semantic labels for the values of title, author, publisher, etc., are given. Having
semantic labels for data units is not only important for the above record linkage task, but also
for storing collected SRRs into a database table.
PROPOSED SYSTEM:
In this paper, we consider how to automatically assign labels to the data units within the SRRs
returned from WDBs. Given a set of SRRs that have been extracted from a result page returned
from a WDB, our automatic annotation solution consists of three phases.
ADVANTAGES OF PROPOSED SYSTEM:
This paper has the following contributions:
While most existing approaches simply assign labels to each HTML text node, we
thoroughly analyze the relationships between text nodes and data units. We perform data
unit level annotation.
We propose a clustering-based shifting technique to align data units into different groups
so that the data units inside the same group have the same semantic. Instead of using only
the DOM tree or other HTML tag tree structures of the SRRs to align the data units (like
most current methods do), our approach also considers other important features shared
among data units, such as their data types (DT), data contents (DC), presentation styles
(PS), and adjacency (AD) information.
We utilize the integrated interface schema (IIS) over multiple WDBs in the same domain
to enhance data unit annotation. To the best of our knowledge, we are the first to utilize
IIS for annotating SRRs.
We employ six basic annotators; each annotator can independently assign labels to data
units based on certain features of the data units. We also employ a probabilistic model to
combine the results from different annotators into a single label. This model is highly
flexible so that the existing basic annotators may be modified and new annotators may be
added easily without affecting the operation of other annotators.
We construct an annotation wrapper for any given WDB. The wrapper can be applied to
efficiently annotating the SRRs retrieved from the same WDB with new queries.
ALGORITHMS USED:
Alignment algorithm
Annotating search results from web databases
SYSTEM CONFIGURATION:-
HARDWARE CONFIGURATION:-
 Processor - Pentium –IV
 Speed - 1.1 Ghz
 RAM - 256 MB(min)
 Hard Disk - 20 GB
 Key Board - Standard Windows Keyboard
 Mouse - Two or Three Button Mouse
 Monitor - SVGA
SOFTWARE CONFIGURATION:-
 Operating System : Windows XP
 Programming Language : JAVA
 Java Version : JDK 1.6 & above.
REFERENCE:
Yiyao Lu, Hai He, Hongkun Zhao, Weiyi Meng, Member, IEEE, and Clement Yu, Senior
Member, IEEE-“ Annotating Search Results from Web Databases”- IEEE TRANSACTIONS
ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013.

More Related Content

PPTX
Annotating Search Results from Web Databases
PPSX
Annotating search results from web databases-IEEE Transaction Paper 2013
PDF
Annotating Search Results from Web Databases
DOCX
Annotating search results from web databases
PDF
At33264269
PDF
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
PDF
A Novel Data Extraction and Alignment Method for Web Databases
PDF
Vision Based Deep Web data Extraction on Nested Query Result Records
Annotating Search Results from Web Databases
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating Search Results from Web Databases
Annotating search results from web databases
At33264269
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
A Novel Data Extraction and Alignment Method for Web Databases
Vision Based Deep Web data Extraction on Nested Query Result Records

What's hot (17)

PPS
ความรู้เบื้องต้นฐานข้อมูล 1
PDF
Mongo db a deep dive of mongodb indexes
PDF
Data Convergence White Paper
PDF
Using Page Size for Controlling Duplicate Query Results in Semantic Web
PDF
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
PDF
An extended database reverse engineering – a key for database forensic invest...
PDF
IRJET- Data Retrieval using Master Resource Description Framework
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
Introduction to database
PDF
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
PDF
Efficient Record De-Duplication Identifying Using Febrl Framework
PDF
Udd for multiple web databases
PPTX
Metadata mapping
DOCX
Facilitating document annotation using content and querying value
DOCX
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
DOCX
facilitating document annotation using content and querying value
ความรู้เบื้องต้นฐานข้อมูล 1
Mongo db a deep dive of mongodb indexes
Data Convergence White Paper
Using Page Size for Controlling Duplicate Query Results in Semantic Web
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
An extended database reverse engineering – a key for database forensic invest...
IRJET- Data Retrieval using Master Resource Description Framework
International Journal of Engineering Research and Development (IJERD)
Introduction to database
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
Efficient Record De-Duplication Identifying Using Febrl Framework
Udd for multiple web databases
Metadata mapping
Facilitating document annotation using content and querying value
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
facilitating document annotation using content and querying value
Ad

Similar to Annotating search results from web databases (20)

PPTX
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
PDF
Annotation for query result records based on domain specific ontology
PDF
At33264269
DOCX
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
DOCX
keyword query routing
PPTX
Presentation1
PPTX
Databases and its representation
DOCX
JPJ1423 Keyword Query Routing
PPTX
DMBS Indexes.pptx
DOCX
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
DOCX
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
PDF
F0362036045
PPTX
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
PDF
What Are the Key Steps in Scraping Product Data from Amazon India.pdf
PDF
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
PDF
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
PDF
Mdb dn 2016_04_check_constraints
PPTX
Relational database concept and technology
PPTX
object oriented analysis data.pptx
PPT
Business Intelligence Solution Using Search Engine
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
Annotation for query result records based on domain specific ontology
At33264269
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
keyword query routing
Presentation1
Databases and its representation
JPJ1423 Keyword Query Routing
DMBS Indexes.pptx
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
F0362036045
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pdf
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
Mdb dn 2016_04_check_constraints
Relational database concept and technology
object oriented analysis data.pptx
Business Intelligence Solution Using Search Engine
Ad

More from IEEEFINALYEARPROJECTS (20)

DOCX
Scalable face image retrieval using attribute enhanced sparse codewords
DOCX
Scalable face image retrieval using attribute enhanced sparse codewords
DOCX
Reversible watermarking based on invariant image classification and dynamic h...
DOCX
Reversible data hiding with optimal value transfer
DOCX
Query adaptive image search with hash codes
DOCX
Noise reduction based on partial reference, dual-tree complex wavelet transfo...
DOCX
Local directional number pattern for face analysis face and expression recogn...
DOCX
An access point based fec mechanism for video transmission over wireless la ns
DOCX
Towards differential query services in cost efficient clouds
DOCX
Spoc a secure and privacy preserving opportunistic computing framework for mo...
DOCX
Secure and efficient data transmission for cluster based wireless sensor netw...
DOCX
Privacy preserving back propagation neural network learning over arbitrarily ...
DOCX
Non cooperative location privacy
DOCX
Harnessing the cloud for securely outsourcing large
DOCX
Geo community-based broadcasting for data dissemination in mobile social netw...
DOCX
Enabling data dynamic and indirect mutual trust for cloud computing storage s...
DOCX
Dynamic resource allocation using virtual machines for cloud computing enviro...
DOCX
A secure protocol for spontaneous wireless ad hoc networks creation
DOCX
Utility privacy tradeoff in databases an information-theoretic approach
DOCX
Two tales of privacy in online social networks
Scalable face image retrieval using attribute enhanced sparse codewords
Scalable face image retrieval using attribute enhanced sparse codewords
Reversible watermarking based on invariant image classification and dynamic h...
Reversible data hiding with optimal value transfer
Query adaptive image search with hash codes
Noise reduction based on partial reference, dual-tree complex wavelet transfo...
Local directional number pattern for face analysis face and expression recogn...
An access point based fec mechanism for video transmission over wireless la ns
Towards differential query services in cost efficient clouds
Spoc a secure and privacy preserving opportunistic computing framework for mo...
Secure and efficient data transmission for cluster based wireless sensor netw...
Privacy preserving back propagation neural network learning over arbitrarily ...
Non cooperative location privacy
Harnessing the cloud for securely outsourcing large
Geo community-based broadcasting for data dissemination in mobile social netw...
Enabling data dynamic and indirect mutual trust for cloud computing storage s...
Dynamic resource allocation using virtual machines for cloud computing enviro...
A secure protocol for spontaneous wireless ad hoc networks creation
Utility privacy tradeoff in databases an information-theoretic approach
Two tales of privacy in online social networks

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PPTX
A Presentation on Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
Unlocking AI with Model Context Protocol (MCP)
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Group 1 Presentation -Planning and Decision Making .pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
A Presentation on Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Getting Started with Data Integration: FME Form 101
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...

Annotating search results from web databases

  • 1. Annotating Search Results from Web Databases ABSTRACT: An increasing number of databases have become web accessible through HTML form-based search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data units to be machine process able, which is essential for many applications such as deep web data collection and Internet comparison shopping, they need to be extracted out and assigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data units on a result page into different groups such that the data in the same group have the same semantic. Then, for each group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result pages from the same web database. Our experiments indicate that the proposed approach is highly effective. EXISTING SYSTEM: In this existing system, a data unit is a piece of text that semantically represents one concept of an entity. It corresponds to the value of a record under an attribute. It is different from a text node which refers to a sequence of text surrounded by a pair of HTML tags. It describes the relationships between text nodes and data units in detail. In this paper, we perform data unit GLOBALSOFT TECHNOLOGIES IEEE PROJECTS & SOFTWARE DEVELOPMENTS IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401 Visit: www.finalyearprojects.org Mail to:[email protected]
  • 2. level annotation. There is a high demand for collecting data of interest from multiple WDBs. For example, once a book comparison shopping system collects multiple result records from different book sites, it needs to determine whether any two SRRs refer to the same book. DISADVANTAGES OF EXISTING SYSTEM: If ISBNs are not available, their titles and authors could be compared. The system also needs to list the prices offered by each site. Thus, the system needs to know the semantic of each data unit. Unfortunately, the semantic labels of data units are often not provided in result pages. For instance, no semantic labels for the values of title, author, publisher, etc., are given. Having semantic labels for data units is not only important for the above record linkage task, but also for storing collected SRRs into a database table. PROPOSED SYSTEM: In this paper, we consider how to automatically assign labels to the data units within the SRRs returned from WDBs. Given a set of SRRs that have been extracted from a result page returned from a WDB, our automatic annotation solution consists of three phases. ADVANTAGES OF PROPOSED SYSTEM: This paper has the following contributions: While most existing approaches simply assign labels to each HTML text node, we thoroughly analyze the relationships between text nodes and data units. We perform data unit level annotation. We propose a clustering-based shifting technique to align data units into different groups so that the data units inside the same group have the same semantic. Instead of using only the DOM tree or other HTML tag tree structures of the SRRs to align the data units (like most current methods do), our approach also considers other important features shared among data units, such as their data types (DT), data contents (DC), presentation styles (PS), and adjacency (AD) information.
  • 3. We utilize the integrated interface schema (IIS) over multiple WDBs in the same domain to enhance data unit annotation. To the best of our knowledge, we are the first to utilize IIS for annotating SRRs. We employ six basic annotators; each annotator can independently assign labels to data units based on certain features of the data units. We also employ a probabilistic model to combine the results from different annotators into a single label. This model is highly flexible so that the existing basic annotators may be modified and new annotators may be added easily without affecting the operation of other annotators. We construct an annotation wrapper for any given WDB. The wrapper can be applied to efficiently annotating the SRRs retrieved from the same WDB with new queries. ALGORITHMS USED: Alignment algorithm
  • 5. SYSTEM CONFIGURATION:- HARDWARE CONFIGURATION:-  Processor - Pentium –IV  Speed - 1.1 Ghz  RAM - 256 MB(min)  Hard Disk - 20 GB  Key Board - Standard Windows Keyboard  Mouse - Two or Three Button Mouse  Monitor - SVGA SOFTWARE CONFIGURATION:-  Operating System : Windows XP  Programming Language : JAVA  Java Version : JDK 1.6 & above. REFERENCE: Yiyao Lu, Hai He, Hongkun Zhao, Weiyi Meng, Member, IEEE, and Clement Yu, Senior Member, IEEE-“ Annotating Search Results from Web Databases”- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 3, MARCH 2013.