#GRD Journals:- Reducing Human Effort: Web Data Mining, Learning a New Characteristics from Big Data
Author Name:- Mr. M. Srinivasan
Department of Information & Technology
Priyadarshini Engineering College, Vaniyambadi, Vellore, India
GRD Journals Published Paper in Volume 1, Issue 1 of GRD Journal for Engineering and This Paper is following.
Abstract:- This paper presents a Reducing Human Effort: Web Data Mining, Learning a New Characteristics from Big data, reducing human effort in extracting precise information from undetected Web sites. Our approach aims at automatically adapting the information extraction knowledge previously learned from a source Web site to a new undetected site, at the same time, discovering previously undetected attributes. There is a two kinds of text related evidences from the source Web site are considered. The first kind of evidences is obtained from the extraction pattern contained in the previously learned wrapper. The second kind of evidences is derived from the previously extracted or collected items. A generative model for the generation of the web site independent content information and the site dependent layout format of the text fragments related to attribute values contained in a Web page is designed to connect the insecurity involved. We have conducted extensive experiments from more than 50 real world Web sites in more than five different domains to demonstrate the effectiveness of our context.
Keywords- Big Data, DOM, Extraction Pattern, Wrapper Learning & Adaption
Information extraction systems aim at automatically extracting precise and exact text fragments from documents. They can also transform largely unstructured information to structured data for further intelligent processing.
A common information extraction technique for semi structured documents such as Web pages are known as wrappers. A wrapper normally consists of a set of extraction rules which were typically manually constructed by human experts in the past. Recently, several wrapper learning approaches have been proposed for automatically learning wrappers from training examples.
For instance, consider a Web page shown in Fig. 1 collected from a Web site1 in the book catalog domain. To learn the wrapper for automatically extracting information from this Web site, one can manually provide some training examples. For example, a user may label the text fragment “C, C++Programming” as the book title, and the fragments “Dr. Balagurusamy” as the corresponding authors. A wrapper learning method can then automatically learn the wrapper based on the text patterns embedded in training examples, as well as the text patterns related to the layout format embodied in the HTML document.
The learned wrapper can be applied to other Web pages of the same Website to extract information. Wrapper learning systems can significantly reduce the amount of human effort in constructing wrappers. Though many existing wrapper learning methods can effectively extract information from the same Web site and achieve very good performance, one restriction of a learned wrapper is that it cannot be applied to previously undetected Web sites, even in the same domain.
For example, the wrapper previously learned from the source Web site shown in Fig. 1 can be adapted to the new undetected site Shown in Fig.2 the adapted wrapper can then be applied to Web pages of this new site for extracting data records. Consequently, it can significantly reduce the human effort in preparing training examples for learning wrappers for different sites. Another shortcoming of existing wrapper learning techniques is that attributes extracted by the learned wrapper are limited to those defined in the training process.
As a result, they can only handle pre specified attributes. For example, if the previously learned wrapper only contains extraction patterns for the attributes title, author, and price from the source Web site shown in Fig. 1, the adapted wrapper can at best extract these attributes from new undetected sites. However, a new undetected site may contain some new additional attributes that do not appear in the source Web site. For instance, book records in Fig.2 contain the attribute ISBN that does not exist in Fig.1. The ISBN of the book records cannot be extracted. This observation leads to another objective of this paper. We investigate the problem of new attribute discovery which aims at extracting the unspecified attributes from new undetected sites. New attribute discovery can effectively deliver more useful information to users.
II. Related Work
Previous proposed a method which alleviates the problem of manually preparing training data by investigating wrapper adaptation. From number of Web sites some rules are learned and these rules are used for data extraction. One disadvantage of this method is that training examples from several Web sites must be collected to learn such heuristic rules.
Here bootstrapping data repository is assumed, which is called as source repository that contains a set of objects belonging to the same domain. This approach assumes that attributes in source repository must match the attributes in new web site. However, exact matching is not possible. The training stage consists of background knowledge acquisition, where data is collected in a particular domain and a structural description of data is learned. Now based on learned rules data from new site is extracted. The extracted data are then organized in a table format.
Each column of the table is labeled by matching with the entries in the column and the patterns learned in the source site. It provides only a single attribute for the entire column which, may consists of inconsistent or incorrectly extracted data.
Generalized node of length r consists of r nodes in the HTML tag tree with the following two properties:
1) The nodes all have the same parent.
2) The nodes are adjacent.
A data region is a collection of two or more generalized nodes.
This method works as follows,
1) Step 1: Build a HTML tag tree of the page.
2) Step 2: Mining data regions in the page using the tag tree and string comparison.
3) Step 3: Identifying data records from each data region.
This method suffers from a major drawback that it cannot differentiate the type and the meaning of the information extracted. Hence, the items extracted require human effort to interpret the meaning.
For More Information about This Paper CLICK HERE