Using links to aid web classification
- Authors: Xie, Wei , Mammadov, Musa , Yearwood, John
- Date: 2007
- Type: Text , Conference paper
- Relation: Paper presented at 6th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2007, Melbourne, Victoria : 11th-13th July 2007 p. 981-986
- Full Text:
- Description: In this paper, we will present a new approach of using link information to improve the accuracy and efficiency of web classification. However, different from others, we only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets. We apply SVM and BoosTexter for classification. We show that the classification accuracy can be improved based on mixtures of ordinary word features and out-linked-class features. We analyze and discuss the reason of this improvement.
- Description: 2003005438
Classification of HTML Documents
- Authors: Xie, Wei
- Date: 2006
- Type: Text , Thesis , PhD
- Full Text:
- Description: Text Classification is the task of mapping a document into one or more classes based on the presence or absence of words (or features) in the document. It is intensively being studied and different classification techniques and algorithms have been developed. This thesis focuses on classification of online documents that has become more critical with the development of World Wide Web. The WWW vastly increases the availability of on-line documents in digital format and has highlighted the need to classify them. From this background, we have noted the emergence of “automatic Web Classification”. These mainly concentrate on classifying HTML-like documents into classes or categories by not only using the methods that are inherited from the traditional Text Classification process, but also utilizing the extra information provided only by Web pages. Our work is based on the fact that, Web documents, contain not only ordinary features (words) but also extra information, such as meta-data and hyperlinks that can be used to advantage the classification process. The aim of this research is to study various ways of using the extra information, in particularly, hyperlink information provided by HTML-documents (Web pages). The merit of the approach, developed in this thesis, is its simplicity, compared with existing approaches. We present different approaches of using hyperlink information to improve the effectiveness of web classification. Unlike other work in this area, we will only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets, and then apply classifiers on them for classification. In the numerical experiments we adopted two wellknown Text Classification algorithms, Support Vector Machines and BoosTexter. The results obtained show that classification accuracy can be improved by using mixtures of ordinary and linked-class features. Moreover, out-links usually work better than in-links in classification. We also analyse and discuss the reasons behind this improvement.
- Description: Master of Computing