#GRD Journals:- Reducing Human Effort: Web Data Mining, Learning a New Characteristics from Big Data
Author Name:- Mr. M. Srinivasan
Department of Information & Technology
Priyadarshini Engineering College, Vaniyambadi, Vellore,
India
GRD Journals Published Paper in Volume 1, Issue 1 of GRD Journal for Engineering and This Paper is following.
Abstract:- This paper presents a Reducing Human Effort: Web Data Mining,
Learning a New Characteristics from Big data, reducing human effort in
extracting precise information from undetected Web sites. Our approach aims at
automatically adapting the information extraction knowledge previously learned
from a source Web site to a new undetected site, at the same time, discovering
previously undetected attributes. There is a two kinds of text related
evidences from the source Web site are considered. The first kind of evidences
is obtained from the extraction pattern contained in the previously learned
wrapper. The second kind of evidences is derived from the previously extracted
or collected items. A generative model for the generation of the web site
independent content information and the site dependent layout format of the
text fragments related to attribute values contained in a Web page is designed
to connect the insecurity involved. We have conducted extensive experiments
from more than 50 real world Web sites in more than five different domains to
demonstrate the effectiveness of our context.
Keywords-
Big Data, DOM, Extraction Pattern, Wrapper Learning & Adaption
I.
Introduction
Information extraction systems aim at
automatically extracting precise and exact text fragments from documents. They
can also transform largely unstructured information to structured data for
further intelligent processing.
A common
information extraction technique for semi structured documents such as Web
pages are known as wrappers. A wrapper normally consists of a set of extraction
rules which were typically manually constructed by human experts in the past.
Recently, several wrapper learning approaches have been proposed for
automatically learning wrappers from training examples.
For instance,
consider a Web page shown in Fig. 1 collected from a Web site1 in the book
catalog domain. To learn the wrapper for
automatically extracting information from this Web site, one can manually
provide some training examples. For example, a user may label the text fragment “C,
C++Programming” as the book title, and the fragments “Dr.
Balagurusamy” as the corresponding authors. A wrapper learning method can then
automatically learn the wrapper based on the text patterns embedded in training
examples, as well as the text patterns related to the layout format embodied in
the HTML document.
The learned
wrapper can be applied to other Web pages of the same Website to extract
information. Wrapper learning systems can significantly reduce the amount of
human effort in constructing wrappers. Though many existing wrapper learning
methods can effectively extract information from the same Web site and achieve
very good performance, one restriction of a learned wrapper is that it cannot
be applied to previously undetected Web sites, even in the same domain.
For example, the
wrapper previously learned from the source Web site shown in Fig. 1 can be
adapted to the new undetected site Shown in Fig.2 the adapted wrapper can then
be applied to Web pages of this new site for extracting data records.
Consequently, it can significantly reduce the human effort in preparing
training examples for learning wrappers for different sites. Another
shortcoming of existing wrapper learning techniques is that attributes
extracted by the learned wrapper are limited to those defined in the training process.
As a result, they can only handle pre specified
attributes. For example, if the previously learned wrapper only contains
extraction patterns for the attributes title, author, and price from the source
Web site shown in Fig. 1, the adapted wrapper can at best extract these
attributes from new undetected sites.
However, a new undetected site may contain some new additional
attributes that do not appear in the source Web site. For instance, book
records in Fig.2 contain the attribute ISBN that does not exist in Fig.1. The
ISBN of the book records cannot be extracted. This observation leads to another
objective of this paper. We investigate the problem of new attribute discovery
which aims at extracting the unspecified attributes from new undetected sites.
New attribute discovery can effectively deliver more useful information to
users.
II.
Related Work
Previous proposed a method which alleviates
the problem of manually preparing training data by investigating wrapper
adaptation. From number of Web sites some rules are learned and these rules are
used for data extraction. One disadvantage of this method is
that training examples from several Web sites must be
collected to learn such heuristic rules.
Here
bootstrapping data repository is assumed, which is called as source repository
that contains a set of objects belonging to the same domain. This approach
assumes that attributes in source repository
must match the attributes in
new web site. However, exact matching is not
possible. The training stage consists of background knowledge acquisition,
where data is collected in a particular domain and a structural description of
data is learned. Now based on learned
rules data from new site is extracted. The extracted data are then organized in
a table format.
Each column of
the table is labeled by matching with the entries in the column and the
patterns learned in the source site. It provides only a single attribute for
the entire column which, may consists of inconsistent or incorrectly extracted
data.
Generalized node of length r consists of r
nodes in the HTML tag tree with the following two properties:
1)
The nodes all have the same
parent.
2)
The nodes are adjacent.
A data region is a collection of two or
more generalized nodes.
This method works as follows,
1)
Step 1: Build a HTML tag tree
of the page.
2)
Step 2: Mining data regions in the page using the tag
tree and string comparison.
3)
Step 3: Identifying data
records from each data region.
This method
suffers from a major drawback that it cannot differentiate the type and the
meaning of the information extracted. Hence, the items
extracted require human
effort to interpret the meaning.
For More Information about This Paper CLICK HERE
Comments
Post a Comment