Classification of HTML Documents
- Xie, Wei
- Authors: Xie, Wei
- Date: 2006
- Type: Text , Thesis , PhD
- Full Text:
- Description: Text Classification is the task of mapping a document into one or more classes based on the presence or absence of words (or features) in the document. It is intensively being studied and different classification techniques and algorithms have been developed. This thesis focuses on classification of online documents that has become more critical with the development of World Wide Web. The WWW vastly increases the availability of on-line documents in digital format and has highlighted the need to classify them. From this background, we have noted the emergence of “automatic Web Classification”. These mainly concentrate on classifying HTML-like documents into classes or categories by not only using the methods that are inherited from the traditional Text Classification process, but also utilizing the extra information provided only by Web pages. Our work is based on the fact that, Web documents, contain not only ordinary features (words) but also extra information, such as meta-data and hyperlinks that can be used to advantage the classification process. The aim of this research is to study various ways of using the extra information, in particularly, hyperlink information provided by HTML-documents (Web pages). The merit of the approach, developed in this thesis, is its simplicity, compared with existing approaches. We present different approaches of using hyperlink information to improve the effectiveness of web classification. Unlike other work in this area, we will only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets, and then apply classifiers on them for classification. In the numerical experiments we adopted two wellknown Text Classification algorithms, Support Vector Machines and BoosTexter. The results obtained show that classification accuracy can be improved by using mixtures of ordinary and linked-class features. Moreover, out-links usually work better than in-links in classification. We also analyse and discuss the reasons behind this improvement.
- Description: Master of Computing
- Authors: Xie, Wei
- Date: 2006
- Type: Text , Thesis , PhD
- Full Text:
- Description: Text Classification is the task of mapping a document into one or more classes based on the presence or absence of words (or features) in the document. It is intensively being studied and different classification techniques and algorithms have been developed. This thesis focuses on classification of online documents that has become more critical with the development of World Wide Web. The WWW vastly increases the availability of on-line documents in digital format and has highlighted the need to classify them. From this background, we have noted the emergence of “automatic Web Classification”. These mainly concentrate on classifying HTML-like documents into classes or categories by not only using the methods that are inherited from the traditional Text Classification process, but also utilizing the extra information provided only by Web pages. Our work is based on the fact that, Web documents, contain not only ordinary features (words) but also extra information, such as meta-data and hyperlinks that can be used to advantage the classification process. The aim of this research is to study various ways of using the extra information, in particularly, hyperlink information provided by HTML-documents (Web pages). The merit of the approach, developed in this thesis, is its simplicity, compared with existing approaches. We present different approaches of using hyperlink information to improve the effectiveness of web classification. Unlike other work in this area, we will only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets, and then apply classifiers on them for classification. In the numerical experiments we adopted two wellknown Text Classification algorithms, Support Vector Machines and BoosTexter. The results obtained show that classification accuracy can be improved by using mixtures of ordinary and linked-class features. Moreover, out-links usually work better than in-links in classification. We also analyse and discuss the reasons behind this improvement.
- Description: Master of Computing
The effectiveness of using static features in identifying scam genres
- Authors: Stabek, Amber
- Date: 2010
- Type: Text , Thesis , Masters
- Full Text:
- Description: Thesis details a cybercrime classification framework stemming from a mixed methodological approach, which is both top down and bottom up and is designed to be multidisciplinary and adaptable across sectors.
- Description: Master by Research of Mathematical Sciences
- Description: Variation in scam classification is regularly identified as a primary cause of discrepancy in victim report data resulting in unsuccessful scam identification and insufficient rates of interception by law enforcement, which results in the low prosecution rate of scammers. The result of such discrepancies lead to complex concerns, such as the under reporting of scam incidence, and reduced rates of successful follow up by investigative and enforcement agencies consequential to difficulties in making correct referrals. Without a shared and common lexicon of scam labels and descriptions, communication between investigative agencies and cross-border cooperation is obstructed. With no compatible comprehension of the scam lexicon, timely progression in scam-case management leading to the identification, tracking and interception of scammer communications cannot be realised. Ambiguities leading to interpretational impedances are aiding scammers by enabling their scams in cross-jurisdictional and multi-national platforms. If the wide variety of known scam types could be condensed to recognisable and traceable instances, the business models that scammers use could be identified and future scamming events predicted, monitored, and interrupted. Following a mixed methodology, this research aims to address some of these concerns. This is achieved by clustering scam descriptions and partitioning them into scam types, called scam genres. The result of which reveals homogeneous groups of scam cases and allows for the assessment of the effectiveness of using static features in identifying scam types. Second to this, identification of the most suitable model for reducing scam cases into the fewest number of clusters with the least number of scam cases within in each cluster at an accuracy level of at least 95% is achieved. Through the use of hierarchical clustering, this research grouped publically available scams into homogeneous clusters of scam genres. Two-hundred and seventy-seven scams from 38 separate categories of scam classification were condensed into as few as 7-clusters of scam genre. Following a mixed methodological, grounded theoretical approach and using discriminant function analysis, 82 static features were derived from the 277 scam descriptions analysed. Of the 82 static features derived, it was concluded that only 68 significantly predicted scam type and explained 95% of the total variation found in scam case assignment. The most significant static features determined to be crucial to any scamming campaign and useful in identifying the type of scam genre a scam case belongs to were; what the scam offered, the role of the victim, the goal of the scammer and the method of scam introduction. The results of this research provide empirical evidence of the inconsistent use of definitions across jurisdictions in scam descriptions, and will contribute to the development of a uniform lexicon of scamming terminology as well as become foundational to further research on the impact of scams for law enforcement, the public and private sector, the community and the individual.
- Authors: Stabek, Amber
- Date: 2010
- Type: Text , Thesis , Masters
- Full Text:
- Description: Thesis details a cybercrime classification framework stemming from a mixed methodological approach, which is both top down and bottom up and is designed to be multidisciplinary and adaptable across sectors.
- Description: Master by Research of Mathematical Sciences
- Description: Variation in scam classification is regularly identified as a primary cause of discrepancy in victim report data resulting in unsuccessful scam identification and insufficient rates of interception by law enforcement, which results in the low prosecution rate of scammers. The result of such discrepancies lead to complex concerns, such as the under reporting of scam incidence, and reduced rates of successful follow up by investigative and enforcement agencies consequential to difficulties in making correct referrals. Without a shared and common lexicon of scam labels and descriptions, communication between investigative agencies and cross-border cooperation is obstructed. With no compatible comprehension of the scam lexicon, timely progression in scam-case management leading to the identification, tracking and interception of scammer communications cannot be realised. Ambiguities leading to interpretational impedances are aiding scammers by enabling their scams in cross-jurisdictional and multi-national platforms. If the wide variety of known scam types could be condensed to recognisable and traceable instances, the business models that scammers use could be identified and future scamming events predicted, monitored, and interrupted. Following a mixed methodology, this research aims to address some of these concerns. This is achieved by clustering scam descriptions and partitioning them into scam types, called scam genres. The result of which reveals homogeneous groups of scam cases and allows for the assessment of the effectiveness of using static features in identifying scam types. Second to this, identification of the most suitable model for reducing scam cases into the fewest number of clusters with the least number of scam cases within in each cluster at an accuracy level of at least 95% is achieved. Through the use of hierarchical clustering, this research grouped publically available scams into homogeneous clusters of scam genres. Two-hundred and seventy-seven scams from 38 separate categories of scam classification were condensed into as few as 7-clusters of scam genre. Following a mixed methodological, grounded theoretical approach and using discriminant function analysis, 82 static features were derived from the 277 scam descriptions analysed. Of the 82 static features derived, it was concluded that only 68 significantly predicted scam type and explained 95% of the total variation found in scam case assignment. The most significant static features determined to be crucial to any scamming campaign and useful in identifying the type of scam genre a scam case belongs to were; what the scam offered, the role of the victim, the goal of the scammer and the method of scam introduction. The results of this research provide empirical evidence of the inconsistent use of definitions across jurisdictions in scam descriptions, and will contribute to the development of a uniform lexicon of scamming terminology as well as become foundational to further research on the impact of scams for law enforcement, the public and private sector, the community and the individual.
- «
- ‹
- 1
- ›
- »