Authorship attribution for Twitter in 140 characters or less
- Layton, Robert, Watters, Paul, Dazeley, Richard
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at - 2nd Cybercrime and Trustworthy Computing Workshop, CTC 2010 p. 1-8
- Full Text:
- Reviewed:
- Description: Authorship attribution is a growing field, moving from beginnings in linguistics to recent advances in text mining. Through this change came an increase in the capability of authorship attribution methods both in their accuracy and the ability to consider more difficult problems. Research into authorship attribution in the 19th century considered it difficult to determine the authorship of a document of fewer than 1000 words. By the 1990s this values had decreased to less than 500 words and in the early 21 st century it was considered possible to determine the authorship of a document in 250 words. The need for this ever decreasing limit is exemplified by the trend towards many shorter communications rather than fewer longer communications, such as the move from traditional multi-page handwritten letters to shorter, more focused emails. This trend has also been shown in online crime, where many attacks such as phishing or bullying are performed using very concise language. Cybercrime messages have long been hosted on Internet Relay Chats (IRCs) which have allowed members to hide behind screen names and connect anonymously. More recently, Twitter and other short message based web services have been used as a hosting ground for online crimes. This paper presents some evaluations of current techniques and identifies some new preprocessing methods that can be used to enable authorship to be determined at rates significantly better than chance for documents of 140 characters or less, a format popularised by the micro-blogging website Twitter1. We show that the SCAP methodology performs extremely well on twitter messages and even with restrictions on the types of information allowed, such as the recipient of directed messages, still perform significantly higher than chance. Further to this, we show that 120 tweets per user is an important threshold, at which point adding more tweets per user gives a small but non-significant increase in accuracy. © 2010 IEEE.
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at - 2nd Cybercrime and Trustworthy Computing Workshop, CTC 2010 p. 1-8
- Full Text:
- Reviewed:
- Description: Authorship attribution is a growing field, moving from beginnings in linguistics to recent advances in text mining. Through this change came an increase in the capability of authorship attribution methods both in their accuracy and the ability to consider more difficult problems. Research into authorship attribution in the 19th century considered it difficult to determine the authorship of a document of fewer than 1000 words. By the 1990s this values had decreased to less than 500 words and in the early 21 st century it was considered possible to determine the authorship of a document in 250 words. The need for this ever decreasing limit is exemplified by the trend towards many shorter communications rather than fewer longer communications, such as the move from traditional multi-page handwritten letters to shorter, more focused emails. This trend has also been shown in online crime, where many attacks such as phishing or bullying are performed using very concise language. Cybercrime messages have long been hosted on Internet Relay Chats (IRCs) which have allowed members to hide behind screen names and connect anonymously. More recently, Twitter and other short message based web services have been used as a hosting ground for online crimes. This paper presents some evaluations of current techniques and identifies some new preprocessing methods that can be used to enable authorship to be determined at rates significantly better than chance for documents of 140 characters or less, a format popularised by the micro-blogging website Twitter1. We show that the SCAP methodology performs extremely well on twitter messages and even with restrictions on the types of information allowed, such as the recipient of directed messages, still perform significantly higher than chance. Further to this, we show that 120 tweets per user is an important threshold, at which point adding more tweets per user gives a small but non-significant increase in accuracy. © 2010 IEEE.
Local n-grams for author identification: Notebook for PAN at CLEF 2013 C3 - CEUR Workshop Proceedings
- Layton, Robert, Watters, Paul, Dazeley, Richard
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2013
- Type: Text , Conference proceedings
- Full Text:
- Description: Our approach to the author identification task uses existing authorship attribution methods using local n-grams (LNG) and performs a weighted ensemble. This approach came in third for this year's competition, using a relatively simple scheme of weights by training set accuracy. LNG models create profiles, consisting of a list of character n-grams that best represent a particular author's writing. The use of a weighted ensemble improved upon the accuracy of the method without reducing the speed of the algorithm; the submitted solution was not only near the top of the leaderboard in terms of accuracy, but it was also one of the faster algorithms submitted.
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2013
- Type: Text , Conference proceedings
- Full Text:
- Description: Our approach to the author identification task uses existing authorship attribution methods using local n-grams (LNG) and performs a weighted ensemble. This approach came in third for this year's competition, using a relatively simple scheme of weights by training set accuracy. LNG models create profiles, consisting of a list of character n-grams that best represent a particular author's writing. The use of a weighted ensemble improved upon the accuracy of the method without reducing the speed of the algorithm; the submitted solution was not only near the top of the leaderboard in terms of accuracy, but it was also one of the faster algorithms submitted.
Recentred local profiles for authorship attribution
- Layton, Robert, Watters, Paul, Dazeley, Richard
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2012
- Type: Text , Journal article
- Relation: Natural Language Engineering Vol. 18, no. 3 (2012), p. 293-312
- Full Text:
- Reviewed:
- Description: Authorship attribution methods aim to determine the author of a document, by using information gathered from a set of documents with known authors. One method of performing this task is to create profiles containing distinctive features known to be used by each author. In this paper, a new method of creating an author or document profile is presented that detects features considered distinctive, compared to normal language usage. This recentreing approach creates more accurate profiles than previous methods, as demonstrated empirically using a known corpus of authorship problems. This method, named recentred local profiles, determines authorship accurately using a simple 'best matching author' approach to classification, compared to other methods in the literature. The proposed method is shown to be more stable than related methods as parameter values change. Using a weighted voting scheme, recentred local profiles is shown to outperform other methods in authorship attribution, with an overall accuracy of 69.9% on the ad-hoc authorship attribution competition corpus, representing a significant improvement over related methods. Copyright © Cambridge University Press 2011.
- Description: 2003010688
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2012
- Type: Text , Journal article
- Relation: Natural Language Engineering Vol. 18, no. 3 (2012), p. 293-312
- Full Text:
- Reviewed:
- Description: Authorship attribution methods aim to determine the author of a document, by using information gathered from a set of documents with known authors. One method of performing this task is to create profiles containing distinctive features known to be used by each author. In this paper, a new method of creating an author or document profile is presented that detects features considered distinctive, compared to normal language usage. This recentreing approach creates more accurate profiles than previous methods, as demonstrated empirically using a known corpus of authorship problems. This method, named recentred local profiles, determines authorship accurately using a simple 'best matching author' approach to classification, compared to other methods in the literature. The proposed method is shown to be more stable than related methods as parameter values change. Using a weighted voting scheme, recentred local profiles is shown to outperform other methods in authorship attribution, with an overall accuracy of 69.9% on the ad-hoc authorship attribution competition corpus, representing a significant improvement over related methods. Copyright © Cambridge University Press 2011.
- Description: 2003010688
- «
- ‹
- 1
- ›
- »