Skip to main content

#GRD Journals:- Reducing Human Effort: Web Data Mining, Learning a New Characteristics from Big Data

Author Name:- Mr. M. Srinivasan

Department of Information & Technology

Priyadarshini Engineering College, Vaniyambadi, Vellore, India

GRD Journals  Published Paper in Volume 1, Issue 1 of GRD Journal  for Engineering and This Paper is following.


Abstract:- This paper presents a Reducing Human Effort: Web Data Mining, Learning a New Characteristics from Big data, reducing human effort in extracting precise information from undetected Web sites. Our approach aims at automatically adapting the information extraction knowledge previously learned from a source Web site to a new undetected site, at the same time, discovering previously undetected attributes. There is a two kinds of text related evidences from the source Web site are considered. The first kind of evidences is obtained from the extraction pattern contained in the previously learned wrapper. The second kind of evidences is derived from the previously extracted or collected items. A generative model for the generation of the web site independent content information and the site dependent layout format of the text fragments related to attribute values contained in a Web page is designed to connect the insecurity involved. We have conducted extensive experiments from more than 50 real world Web sites in more than five different domains to demonstrate the effectiveness of our context.

Keywords- Big Data, DOM, Extraction Pattern, Wrapper Learning & Adaption

I.       Introduction

Information extraction systems aim at automatically extracting precise and exact text fragments from documents. They can also transform largely unstructured information to structured data for further intelligent processing.
A common information extraction technique for semi structured documents such as Web pages are known as wrappers. A wrapper normally consists of a set of extraction rules which were typically manually constructed by human experts in the past. Recently, several wrapper learning approaches have been proposed for automatically learning wrappers from training examples.
For instance, consider a Web page shown in Fig. 1 collected from a Web site1 in the book catalog domain.  To learn the wrapper for automatically extracting information from this Web site, one can manually provide some training examples. For example, a user may label the text fragment   “C,   C++Programming”   as   the book title, and the fragments “Dr. Balagurusamy” as the corresponding authors. A wrapper learning method can then automatically learn the wrapper based on the text patterns embedded in training examples, as well as the text patterns related to the layout format embodied in the HTML document.
 The learned wrapper can be applied to other Web pages of the same Website to extract information. Wrapper learning systems can significantly reduce the amount of human effort in constructing wrappers. Though many existing wrapper learning methods can effectively extract information from the same Web site and achieve very good performance, one restriction of a learned wrapper is that it cannot be applied to previously undetected Web sites, even in the same domain.
  For example, the wrapper previously learned from the source Web site shown in Fig. 1 can be adapted to the new undetected site Shown in Fig.2 the adapted wrapper can then be applied to Web pages of this new site for extracting data records. Consequently, it can significantly reduce the human effort in preparing training examples for learning wrappers for different sites. Another shortcoming of existing wrapper learning techniques is that attributes extracted by the learned wrapper are limited to those defined in the training process.
           As a result, they can only handle pre specified attributes. For example, if the previously learned wrapper only contains extraction patterns for the attributes title, author, and price from the source Web site shown in Fig. 1, the adapted wrapper can at best extract these attributes from new undetected sites.  However, a new undetected site may contain some new additional attributes that do not appear in the source Web site. For instance, book records in Fig.2 contain the attribute ISBN that does not exist in Fig.1. The ISBN of the book records cannot be extracted. This observation leads to another objective of this paper. We investigate the problem of new attribute discovery which aims at extracting the unspecified attributes from new undetected sites. New attribute discovery can effectively deliver more useful information to users.

II.       Related Work

Previous proposed a method which alleviates the problem of manually preparing training data by investigating wrapper adaptation. From number of Web sites some rules are learned and these rules are used for data extraction. One disadvantage of this method  is  that  training  examples from several Web sites must be collected to learn such heuristic rules.
Here bootstrapping data repository is assumed, which is called as source repository that contains a set of objects belonging to the same domain. This approach assumes that attributes  in source  repository  must  match the attributes  in  new  web  site. However, exact matching is not possible. The training stage consists of background knowledge acquisition, where data is collected in a particular domain and a structural description of data is learned.  Now based on learned rules data from new site is extracted. The extracted data are then organized in a table format.
Each column of the table is labeled by matching with the entries in the column and the patterns learned in the source site. It provides only a single attribute for the entire column which, may consists of inconsistent or incorrectly extracted data.
Generalized node of length r consists of r nodes in the HTML tag tree with the following two properties:
       1)       The nodes all have the same parent.
       2)       The nodes are adjacent.
A data region is a collection of two or more generalized nodes.
This method works as follows,
      1)       Step 1: Build a HTML tag tree of the page.
      2)       Step 2:  Mining data regions in the page using the tag tree and string comparison.
      3)       Step 3: Identifying data records from each data region.
This method suffers from a major drawback that it cannot differentiate the type and  the  meaning  of  the information extracted. Hence, the items extracted  require  human  effort  to interpret the meaning.

For More Information about This Paper CLICK HERE





Comments

Popular posts from this blog

Before writing know why we should write research papers. Here are benefits.

From today we are taking this initiative of educating authors for writing research papers. Writing research paper is bit tedious task.  As we know its necessity. In this blog post of GRD Journals, we need to answer this "WHY" first. As it’s the first thing any author would encounter. Let us introduce ourselves first. We are in the online publishing industry for 3 years and have gone through many ups and downs to sharpen our experience with peer review and manuscript publishing. So today, we want to answer this: Why do we write a research paper? Why it’s necessary and what is the significance of that? To be honest, the research paper writing is not a homework which should be must. It rater is a responsibility to share your knowledge with the world in the best way it can get. We have seen many research articles and came to a conclusion which also applies to our day to day life. “ A forced work is never innovative ” Talking about the research paper w

GRD Journals:- Impact Factor of GRD Journal for Engineering (GRDJE) for Year - 2016

GRD Journal for Engineering Impact Factor for Year - 2016   GRD Journal for Engineering  (GRDJE)  is an uprising peer-reviewed journal in the field of engineering journal. With simple procedure GRDJE provides a best framework to publish the quality research articles and gives it a global recognition. With a simple manuscript submission procedure GRDJE removes the overburden of complex things from authors through automatic procedure. In addition the review is done by professionals with strong educational background who thoroughly reviews your paper and gives the perfact analysis and suggestions if needed to improvise your research. The GRD Journals automated process makes faster publication with indexing into global libraries. Benefits of GRDJE : Fast review process, peer-reviewed, security, flexible payment mathods,  strong support, automated paper status, email and sms notification, open journal, strong review panel. So what are you waiting for, we highly appreciate the c

GRD Journals:- Impact Factor of GRD Journal for Engineering (GRDJE)

GRD Journals GRD journals (Global Research and Development Journals) is an international open access online publication house which promotes the research works under different domains. For the research enthusiasts, the GRD Journals plays a key role of promoter. GRD journals along with the most popular indexing partners around the world puts its strong efforts for supporting an innovative, new and quality research work to reach maximum people as possible. The establishment of this Journal, GRD Journals is an answer to the expectations and search of many researchers and teachers in developing nations who lack free access to quality materials online. This Journal opts to bring panacea to this problem, and to encourage research development. GRD Journals aims to publish research articles of high quality dealing with issues which sets the new mark for outstanding developments in the field of research. Our quick interface will provide a platform to the research scholars for exchan