Authorship attribution for Twitter in 140 characters or less
- Layton, Robert, Watters, Paul, Dazeley, Richard
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at - 2nd Cybercrime and Trustworthy Computing Workshop, CTC 2010 p. 1-8
- Full Text:
- Reviewed:
- Description: Authorship attribution is a growing field, moving from beginnings in linguistics to recent advances in text mining. Through this change came an increase in the capability of authorship attribution methods both in their accuracy and the ability to consider more difficult problems. Research into authorship attribution in the 19th century considered it difficult to determine the authorship of a document of fewer than 1000 words. By the 1990s this values had decreased to less than 500 words and in the early 21 st century it was considered possible to determine the authorship of a document in 250 words. The need for this ever decreasing limit is exemplified by the trend towards many shorter communications rather than fewer longer communications, such as the move from traditional multi-page handwritten letters to shorter, more focused emails. This trend has also been shown in online crime, where many attacks such as phishing or bullying are performed using very concise language. Cybercrime messages have long been hosted on Internet Relay Chats (IRCs) which have allowed members to hide behind screen names and connect anonymously. More recently, Twitter and other short message based web services have been used as a hosting ground for online crimes. This paper presents some evaluations of current techniques and identifies some new preprocessing methods that can be used to enable authorship to be determined at rates significantly better than chance for documents of 140 characters or less, a format popularised by the micro-blogging website Twitter1. We show that the SCAP methodology performs extremely well on twitter messages and even with restrictions on the types of information allowed, such as the recipient of directed messages, still perform significantly higher than chance. Further to this, we show that 120 tweets per user is an important threshold, at which point adding more tweets per user gives a small but non-significant increase in accuracy. © 2010 IEEE.
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at - 2nd Cybercrime and Trustworthy Computing Workshop, CTC 2010 p. 1-8
- Full Text:
- Reviewed:
- Description: Authorship attribution is a growing field, moving from beginnings in linguistics to recent advances in text mining. Through this change came an increase in the capability of authorship attribution methods both in their accuracy and the ability to consider more difficult problems. Research into authorship attribution in the 19th century considered it difficult to determine the authorship of a document of fewer than 1000 words. By the 1990s this values had decreased to less than 500 words and in the early 21 st century it was considered possible to determine the authorship of a document in 250 words. The need for this ever decreasing limit is exemplified by the trend towards many shorter communications rather than fewer longer communications, such as the move from traditional multi-page handwritten letters to shorter, more focused emails. This trend has also been shown in online crime, where many attacks such as phishing or bullying are performed using very concise language. Cybercrime messages have long been hosted on Internet Relay Chats (IRCs) which have allowed members to hide behind screen names and connect anonymously. More recently, Twitter and other short message based web services have been used as a hosting ground for online crimes. This paper presents some evaluations of current techniques and identifies some new preprocessing methods that can be used to enable authorship to be determined at rates significantly better than chance for documents of 140 characters or less, a format popularised by the micro-blogging website Twitter1. We show that the SCAP methodology performs extremely well on twitter messages and even with restrictions on the types of information allowed, such as the recipient of directed messages, still perform significantly higher than chance. Further to this, we show that 120 tweets per user is an important threshold, at which point adding more tweets per user gives a small but non-significant increase in accuracy. © 2010 IEEE.
Consensus clustering and supervised classification for profiling phishing emails in internet commerce security
- Dazeley, Richard, Yearwood, John, Kang, Byeongho, Kelarev, Andrei
- Authors: Dazeley, Richard , Yearwood, John , Kang, Byeongho , Kelarev, Andrei
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at 11th International Workshop on Knowledge Management and Acquisition for Smart Systems and Services, PKAW 2010 Vol. 6232 LNAI, p. 235-246
- Full Text:
- Reviewed:
- Description: This article investigates internet commerce security applications of a novel combined method, which uses unsupervised consensus clustering algorithms in combination with supervised classification methods. First, a variety of independent clustering algorithms are applied to a randomized sample of data. Second, several consensus functions and sophisticated algorithms are used to combine these independent clusterings into one final consensus clustering. Third, the consensus clustering of the randomized sample is used as a training set to train several fast supervised classification algorithms. Finally, these fast classification algorithms are used to classify the whole large data set. One of the advantages of this approach is in its ability to facilitate the inclusion of contributions from domain experts in order to adjust the training set created by consensus clustering. We apply this approach to profiling phishing emails selected from a very large data set supplied by the industry partners of the Centre for Informatics and Applied Optimization. Our experiments compare the performance of several classification algorithms incorporated in this scheme. © 2010 Springer-Verlag Berlin Heidelberg.
- Authors: Dazeley, Richard , Yearwood, John , Kang, Byeongho , Kelarev, Andrei
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at 11th International Workshop on Knowledge Management and Acquisition for Smart Systems and Services, PKAW 2010 Vol. 6232 LNAI, p. 235-246
- Full Text:
- Reviewed:
- Description: This article investigates internet commerce security applications of a novel combined method, which uses unsupervised consensus clustering algorithms in combination with supervised classification methods. First, a variety of independent clustering algorithms are applied to a randomized sample of data. Second, several consensus functions and sophisticated algorithms are used to combine these independent clusterings into one final consensus clustering. Third, the consensus clustering of the randomized sample is used as a training set to train several fast supervised classification algorithms. Finally, these fast classification algorithms are used to classify the whole large data set. One of the advantages of this approach is in its ability to facilitate the inclusion of contributions from domain experts in order to adjust the training set created by consensus clustering. We apply this approach to profiling phishing emails selected from a very large data set supplied by the industry partners of the Centre for Informatics and Applied Optimization. Our experiments compare the performance of several classification algorithms incorporated in this scheme. © 2010 Springer-Verlag Berlin Heidelberg.
Unsupervised authorship analysis of phishing webpages
- Layton, Robert, Watters, Paul, Dazeley, Richard
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2012
- Type: Text , Conference proceedings
- Full Text:
- Description: Authorship analysis on phishing websites enables the investigation of phishing attacks, beyond basic analysis. In authorship analysis, salient features from documents are used to determine properties about the author, such as which of a set of candidate authors wrote a given document. In unsupervised authorship analysis, the aim is to group documents such that all documents by one author are grouped together. Applying this to cyber-attacks shows the size and scope of attacks from specific groups. This in turn allows investigators to focus their attention on specific attacking groups rather than trying to profile multiple independent attackers. In this paper, we analyse phishing websites using the current state of the art unsupervised authorship analysis method, called NUANCE. The results indicate that the application produces clusters which correlate strongly to authorship, evaluated using expert knowledge and external information as well as showing an improvement over a previous approach with known flaws. © 2012 IEEE.
- Description: 2003010678
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2012
- Type: Text , Conference proceedings
- Full Text:
- Description: Authorship analysis on phishing websites enables the investigation of phishing attacks, beyond basic analysis. In authorship analysis, salient features from documents are used to determine properties about the author, such as which of a set of candidate authors wrote a given document. In unsupervised authorship analysis, the aim is to group documents such that all documents by one author are grouped together. Applying this to cyber-attacks shows the size and scope of attacks from specific groups. This in turn allows investigators to focus their attention on specific attacking groups rather than trying to profile multiple independent attackers. In this paper, we analyse phishing websites using the current state of the art unsupervised authorship analysis method, called NUANCE. The results indicate that the application produces clusters which correlate strongly to authorship, evaluated using expert knowledge and external information as well as showing an improvement over a previous approach with known flaws. © 2012 IEEE.
- Description: 2003010678
- «
- ‹
- 1
- ›
- »