Automatically generating classifier for phishing email prediction
- Authors: Ma, Liping , Torney, Rosemary , Watters, Paul , Brown, Simon
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at I-SPAN 2009 - The 10th International Symposium on Pervasive Systems, Algorithms, and Networks, Kaohsiung, Taiwan : 14th-16th December 2009 p. 779-783
- Full Text:
- Description: Phishing is a form of online identity theft that employs both social engineering and technical subterfuge to steal consumers' personal identity data and financial account credentials. Phishing email prediction has drawn a lot of attention from many researchers. According to current anti-phishing research, a classifier generated by decision tree produces the most accurate predictions. However, there appears not to be any open source available to transfer such a decision to an implementable classifier. The work presented in this paper builds a decision tree parser which automatically translates a decision tree into an implementable program language so that the decision is useful in real world applications. Experiment results show that the parser performs as well as the original decision. © 2009 IEEE.
- Description: 2003007989
Applications of machine learning for linguistic analysis of texts
- Authors: Torney, Rosemary , Yearwood, John , Vamplew, Peter , Kelarev, Andrei
- Date: 2012
- Type: Text , Book chapter
- Relation: Machine Learning Algorithms for Problem Solving in Computational Applications: Intelligent Techniques p. 133-148
- Full Text: false
- Reviewed:
- Description: This chapter describes a novel multistage method for linguistic clustering of large collections of texts available on the Internet as a precursor to linguistic analysis of these texts. This method addresses the practicalities of applying clustering operations to a very large set of text documents by using a combination of unsupervised clustering and supervised classification. The method relies on creating a multitude of independent clusterings of a randomized sample selected from the International Corpus of Learner English. Several consensus functions and sophisticated algorithms are applied in two substages to combine these independent clusterings into one final consensus clustering, which is then used to train fast classifiers in order to enable them to perform the profiling of very large collections of text and web data. This approach makes it possible to apply advanced highly accurate and sophisticated clustering techniques by combining them with fast supervised classification algorithms. For the effectiveness of this multistage method it is crucial to determine how well the supervised classification algorithms are going to perform at the final stage, when they are used to process large data sets available on the Internet. This performance may also serve as an indication of the quality of the combined consensus clustering obtained in the preceding stages. The authors' experimental results compare the performance of several classification algorithms incorporated in this multistage scheme and demonstrate that several of these classification algorithms achieve very high precision and recall and can be used in practical implementations of their method.
Using psycholinguistic features for profiling first language of authors
- Authors: Torney, Rosemary , Vamplew, Peter , Yearwood, John
- Date: 2012
- Type: Text , Journal article
- Relation: Journal of the American Society for Information Science and Technology Vol. 63, no. 6 (2012), p. 1256-1269
- Full Text: false
- Reviewed:
- Description: This study empirically evaluates the effectiveness of different feature types for the classification of the first language of an author. In particular, it examines the utility of psycholinguistic features, extracted by the Linguistic Inquiry and Word Count (LIWC) tool, that have not previously been applied to the task of author profiling. As LIWC is a tool that has been developed in the psycholinguistic field rather than the computational linguistics field, it was hypothesized that it would be effective, both as a single type feature set because of its psycholinguistic basis, and in combination with other feature sets, because it should be sufficiently different to add insight rather than redundancy. It was found that LIWC features were competitive with previously used feature types in identifying the first language of an author, and that combined feature sets including LIWC features consistently showed better accuracy rates and average F measures than were achieved by the same feature sets without the LIWC features. As a secondary issue, this study also examined how effectively first language classification scaled up to a larger number of possible languages. It was found that the classification scheme scaled up effectively to the entire 16 language collection from the International Corpus of Learner English, when compared with results achieved on just 5 languages in previous research. 2012 ASIS&T.
Application of psycholinguistic features to authorship profiling for first language, gender and age group
- Authors: Torney, Rosemary
- Date: 2014
- Type: Text , Thesis , PhD
- Full Text:
- Description: Much of the fraud committed in cyberspace involves the misrepresentation of the demographic data of the perpetrator via the medium of seemly anonymous text messages. One way to address this issue is to apply techniques from the field of authorship characterisation or profiling which is the analysis of text to determine the demographic profile of the author. Most of the previous research into authorship characterisation has used counts and ratios of lexicographically based features that include words, parts of words and Parts Of Speech (POS) contained within the text. This study examines the effectiveness of classifying the first language, gender and age group of an author using a set of features developed in the psycholinguistic field (the Linguistic Inquiry and Word Count - LIWC), both as a single type feature set and in combination with the lexicographically based features used in previous studies (function words, character bigrams and POS unigrams and bigrams). This study also searched for the smallest, most effective subset of each feature set that was practical, by ranking the features using three feature selection algorithms and systematically reducing the number used. In addition, the study explored the effective lower word limit for accurate classification by reducing the text size by regular increments. LIWC was found to be more effective than a similar number of any of the lexicographic feature types, and to add insight rather than noise when combined with these feature types. This held to be true for both the full and reduced text sizes for all three demographic classes examined. In addition it was found that the size of feature sets could be greatly reduced while still maintaining effective levels of classification accuracy.
- Description: Doctor of Philosophy