API design for machine learning software: experiences from the scikit-learn project
- Authors: Buitinick, Lars , Louppe, Gilles , Blondel, Mathieu , Pedregosa, Fabian , Mueller, Andrea , Grisel, Olivier , Nicuale, Vlad , Prettenhofer, Peter , Gramfort, Alexandre , Grobler, Jaques , Layton, Robert , Vanderplas, Jake , Joly, Arnaud , Holt, Brian , Varoquax, Gael
- Date: 2013
- Type: Text , Conference paper
- Relation: Workshop on Language for Data mining and Machine Learning LML 013
- Full Text: false
- Reviewed:
- Description: Scikit-learn is an increasingly popular machine learning li- brary. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, we present and discuss our design choices for the application programming interface (API) of the project. In particular, we describe the simple and elegant interface shared by all learning and processing units in the library and then discuss its advantages in terms of composition and reusability. The paper also comments on implementation details specific to the Python ecosystem and analyzes obstacles faced by users and developers of the library.
Application of SVM in citation information extraction
- Authors: Liang, Jiguang , Layton, Robert , Wang, Wei
- Date: 2011
- Type: Text , Conference proceedings
- Full Text: false
- Description: Support Vector Machines are an effective form of binary-class classification algorithm. To enhance the utilization of text structural features for information extraction, which are greatly restricted by the Hidden Markov Model (HMM), this paper proposes a support vector machine multi-class classification based on Markov properties to extract the information from a citation database. The proposed model extracts symbol characteristics as features and composes a binary tree of the transition probabilities. Experiments show that the proposed method outperforms HMM and basic SVM methods. © 2011 IEEE.
Authorship analysis of aliases: Does topic influence accuracy?
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2013
- Type: Text , Journal article
- Relation: Natural Language Engineering Vol. Online first, no. (2013), p.
- Full Text:
- Reviewed:
- Description: Aliases play an important role in online environments by facilitating anonymity, but also can be used to hide the identity of cybercriminals. Previous studies have investigated this alias matching problem in an attempt to identify whether two aliases are shared by an author, which can assist with identifying users. Those studies create their training data by randomly splitting the documents associated with an alias into two sub-aliases. Models have been built that can regularly achieve over 90% accuracy for recovering the linkage between these ‘random sub-aliases’. In this paper, random sub-alias generation is shown to enable these high accuracies, and thus does not adequately model the real-world problem. In contrast, creating sub-aliases using topic-based splitting drastically reduces the accuracy of all authorship methods tested. We then present a methodology that can be performed on non-topic controlled datasets, to produce topic-based sub-aliases that are more difficult to match. Finally, we present an experimental comparison between many authorship methods to see which methods better match aliases under these conditions, finding that local n-gram methods perform better than others.
Authorship analysis of the zeus botnet source code
- Authors: Layton, Robert , Azab, Ahmad
- Date: 2014
- Type: Text , Conference proceedings
- Full Text: false
- Description: Authorship analysis has been used successfully to analyse the provenance of source code files in previous studies. The source code for Zeus, one of the most damaging and effective botnets to date, was leaked in 2011. In this research, we analyse the source code from the lens of authorship clustering, aiming to estimate how many people wrote this malware, and what their roles are. The research provides insight into the structure the went into creating Zeus and its evolution over time. The work has potential to be used to link the malware with other malware written by the same authors, helping investigations, classification, deterrence and detection. © 2014 IEEE.
Authorship attribution for Twitter in 140 characters or less
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at - 2nd Cybercrime and Trustworthy Computing Workshop, CTC 2010 p. 1-8
- Full Text:
- Reviewed:
- Description: Authorship attribution is a growing field, moving from beginnings in linguistics to recent advances in text mining. Through this change came an increase in the capability of authorship attribution methods both in their accuracy and the ability to consider more difficult problems. Research into authorship attribution in the 19th century considered it difficult to determine the authorship of a document of fewer than 1000 words. By the 1990s this values had decreased to less than 500 words and in the early 21 st century it was considered possible to determine the authorship of a document in 250 words. The need for this ever decreasing limit is exemplified by the trend towards many shorter communications rather than fewer longer communications, such as the move from traditional multi-page handwritten letters to shorter, more focused emails. This trend has also been shown in online crime, where many attacks such as phishing or bullying are performed using very concise language. Cybercrime messages have long been hosted on Internet Relay Chats (IRCs) which have allowed members to hide behind screen names and connect anonymously. More recently, Twitter and other short message based web services have been used as a hosting ground for online crimes. This paper presents some evaluations of current techniques and identifies some new preprocessing methods that can be used to enable authorship to be determined at rates significantly better than chance for documents of 140 characters or less, a format popularised by the micro-blogging website Twitter1. We show that the SCAP methodology performs extremely well on twitter messages and even with restrictions on the types of information allowed, such as the recipient of directed messages, still perform significantly higher than chance. Further to this, we show that 120 tweets per user is an important threshold, at which point adding more tweets per user gives a small but non-significant increase in accuracy. © 2010 IEEE.
Authorship attribution of IRC messages using inverse author frequency
- Authors: Layton, Robert , McCombie, Stephen , Watters, Paul
- Date: 2012
- Type: Text , Conference proceedings
- Full Text: false
- Description: Internet Relay Chat (IRC) is a useful and relativelysimple protocol for text based chat online, used in a variety ofareas online such as for discussion and technical support. IRC isalso used for cybercrime, with online rooms selling stolen creditcard details, botnet access and malware. The reasons for theuse of IRC in cybercrime include the widespread adoption andease of use, but also focus around the anonymity granted bythe protocol, allowing users to hide behind aliases that can bechanged regularly. In this research, we apply authorship analysistechniques to be able to attribute chat messages to known aliases.A preliminary experiment shows that this application is verydifficult, due to the short messages and repeated information.To improve the accuracy, we apply inverse-author-frequency(iaf) weighting, which gives higher weights to features used byfewer authors. This research is the first time that iaf has beenapplied to character n-gram models, previously being applied toword based models of authorship. We find that this improvesthe accuracy significantly for the RLP method and provides aplatform for successful applications of authorship analysis in thefuture. Overall, the method achieves accuracies of over 55% ina very difficult application domain. © 2012 IEEE.
- Description: 2003011051
Automated unsupervised authorship analysis using evidence accumulation clustering
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2013
- Type: Text , Journal article
- Relation: Natural Language Engineering Vol. 19, no. 1 (2013), p. 95-120
- Full Text:
- Reviewed:
- Description: Authorship Analysis aims to extract information about the authorship of documents from features within those documents. Typically, this is performed as a classification task with the aim of identifying the author of a document, given a set of documents of known authorship. Alternatively, unsupervised methods have been developed primarily as visualisation tools to assist the manual discovery of clusters of authorship within a corpus by analysts. However, there is a need in many fields for more sophisticated unsupervised methods to automate the discovery, profiling and organisation of related information through clustering of documents by authorship. An automated and unsupervised methodology for clustering documents by authorship is proposed in this paper. The methodology is named NUANCE, for n-gram Unsupervised Automated Natural Cluster Ensemble. Testing indicates that the derived clusters have a strong correlation to the true authorship of unseen documents. © 2011 Cambridge University Press.
- Description: 2003010584
Automatically determining phishing campaigns using the USCAP methodology
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2010
- Type: Text , Conference paper
- Relation: Paper presented at General Members Meeting and eCrime Researchers Summit, eCrime 2010 p. 1-8
- Full Text:
- Reviewed:
- Description: Phishing fraudsters attempt to create an environment which looks and feels like a legitimate institution, while at the same time attempting to bypass filters and suspicions of their targets. This is a difficult compromise for the phishers and presents a weakness in the process of conducting this fraud. In this research, a methodology is presented that looks at the differences that occur between phishing websites from an authorship analysis perspective and is able to determine different phishing campaigns undertaken by phishing groups. The methodology is named USCAP, for Unsupervised SCAP, which builds on the SCAP methodology from supervised authorship and extends it for unsupervised learning problems. The phishing website source code is examined to generate a model that gives the size and scope of each of the recognized phishing campaigns. The USCAP methodology introduces the first time that phishing websites have been clustered by campaign in an automatic and reliable way, compared to previous methods which relied on costly expert analysis of phishing websites. Evaluation of these clusters indicates that each cluster is strongly consistent with a high stability and reliability when analyzed using new information about the attacks, such as the dates that the attack occurred on. The clusters found are indicative of different phishing campaigns, presenting a step towards an automated phishing authorship analysis methodology. © 2010 IEEE.
Automating Open Source Intelligence: Algorithms for OSINT
- Authors: Layton, Robert , Watters, Paul
- Date: 2015
- Type: Text , Book
- Full Text: false
- Reviewed:
- Description: Algorithms for Automating Open Source Intelligence (OSINT) presents information on the gathering of information and extraction of actionable intelligence from openly available sources, including news broadcasts, public repositories, and more recently, social media. As OSINT has applications in crime fighting, state-based intelligence, and social research, this book provides recent advances in text mining, web crawling, and other algorithms that have led to advances in methods that can largely automate this process.
Be careful who you trust: Issues with the public key infrastructure
- Authors: Black, Paul , Layton, Robert
- Type: Text , Conference paper
- Relation: 5th Cybercrime and Trustworthy Computing Conference, CTC 2014, Auckland, 24-25th November, 2014 published in Cybercrime and Trustworthy Computing Conference (CTC) p. 12-21
- Full Text: false
- Reviewed:
- Description: The modern digital internet economy and billions of dollars of trade are made possible by the internet security which is provided by operating system and web browser developers. This paper provides a survey of how this security is implemented through the use of digital certificates and the Public Key Infrastructure. Documented cases of the abuse of these digital certificates are given. It is shown that these problems arise from a combination of commercial pressures and a failure of the designers of internet security to consider the fundamental security principal of least privilege. Measures which are used to mitigate these problems are noted and new PKI architectural components which are designed to correct existing problems are examined.
Characterising and predicting cyber attacks using the Cyber Attacker Model Profile (CAMP)
- Authors: Watters, Paul , McCombie, Stephen , Layton, Robert , Pieprzyk, Josef
- Date: 2012
- Type: Text , Journal article
- Relation: Journal of Money Laundering Control Vol. 15, no. 4 (2012), p. 430-441
- Full Text: false
- Reviewed:
- Description: Purpose – Ethnographic studies of cyber attacks typically aim to explain a particular profile of attackers in qualitative terms. The purpose of this paper is to formalise some of the approaches to build a Cyber Attacker Model Profile (CAMP) that can be used to characterise and predict cyber attacks. Design/methodology/approach – The paper builds a model using social and economic independent or predictive variables from several eastern European countries and benchmarks indicators of cybercrime within the Australian financial services system. Findings – The paper found a very strong link between perceived corruption and GDP in two distinct groups of countries – corruption in Russia was closely linked to the GDP of Belarus, Moldova and Russia, while corruption in Lithuania was linked to GDP in Estonia, Latvia, Lithuania and Ukraine. At the same time corruption in Russia and Ukraine were also closely linked. These results support previous research that indicates a strong link between been legitimate economy and the black economy in many countries of Eastern Europe and the Baltic states. The results of the regression analysis suggest that a highly skilled workforce which is mobile and working in an environment of high perceived corruption in the target countries is related to increases in cybercrime even within Australia. It is important to note that the data used for the dependent and independent variables were gathered over a seven year time period, which included large economic shocks such as the global financial crisis. Originality/value – This is the first paper to use a modelling approach to directly show the relationship between various social, economic and demographic factors in the Baltic states and Eastern Europe, and the level of card skimming and card not present fraud in Australia. Acknowledgements: Paul A. Watters and Robert Layton are funded by IBM, Westpac, the State Government of Victoria and the Australian Federal Police.
- Description: 2003011112
Characterising network traffic for Skype forensics
- Authors: Azab, Ahmad , Watters, Paul , Layton, Robert
- Date: 2012
- Type: Text , Conference proceedings
- Full Text: false
- Description: Voice over IP (VoIP) is increasingly replacing fixed line telephone systems globally due to lower cost, call quality improvements over digital lines and ease of availability. At the same time, criminals have also transitioned to using this environment, creating challenges for law enforcement, since interception of VoIP traffic is more difficult than a traditional telephony environment. One key problem for proprietary VoIP algorithms like Skype is being able to reliably identify and characterize network traffic. In this paper, the latest Skype version and its components are analyzed, in terms of network traffic behavior for logins, calls establishment, call answering and the change status phases. Network conditions tested included blocking different port numbers, inbound connections and outbound connections. The results provide a clearer view of the difficulties in characterizing Skype traffic in forensic contexts. We also found different changes from previous investigations into older versions of Skype. © 2012 IEEE.
- Description: 2003011053
Crime toolkits: The productisation of cybercrime
- Authors: Alazab, Ammar , Abawajy, Jemal , Hobbs, Michael , Layton, Robert , Khraisat, Ansam
- Date: 2013
- Type: Text , Conference paper
- Relation: Proceedings - 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2013 p. 1626-1632
- Full Text: false
- Reviewed:
Determining provenance in phishing websites using automated conceptual analysis
- Authors: Layton, Robert , Watters, Paul
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at 2009 eCrime Researchers Summit, eCRIME '09, Tacoma, Washington : 20th-21st October 2009 p. 1-7
- Full Text:
- Description: Phishing is a form of online fraud with drastic consequences for the victims and institutions being defrauded. A phishing attack tries to create a believable environment for the intended victim to enter their confidential data such that the attacker can use or sell this information later. In order to apprehend phishers, law enforcement agencies need automated systems capable of tracking the size and scope of phishing attacks, in order to more wisely use their resources shutting down the major players, rather then wasting resources stopping smaller operations. In order to develop these systems, phishing attacks need to be clustered by provenance in a way that adequately profiles these evolving attackers. The research presented in this paper looks at the viability of using automated conceptual analysis through cluster analysis techniques on phishing websites, with the aim of determining provenance of these phishing attacks. Conceptual analysis is performed on the source code of the websites, rather than the final text that is displayed to the user, eliminating problems with rendering obfuscation and increasing the distinctiveness brought about by differences in coding styles of the phishers. By using cluster analysis algorithms, distinguishing factors between groups of phishing websites can be obtained. The results indicate that it is difficult to separate websites by provenance without also separating by intent, by looking at the phishing websites alone. Instead, the methods discussed in this paper should form part of a larger system that uses more information about the phishing attacks.
Ethical considerations when using online datasets for research purposes
- Authors: Kopp, Christian , Layton, Robert , Gondal, Iqbal , Sillitoe, Jim
- Date: 2015
- Type: Text , Book chapter
- Relation: Automating Open Source Intelligence: Algorithms for OSINT p. 131-157
- Full Text: false
- Reviewed:
- Description: The Internet has become an important community communications platform, supporting a range of programs and virtual environments. While there are many ways in which people choose to develop personal interactions over the Internet, one of the most popular manifestations is the creation and maintenance of social relationships using social and dating websites. In this chapter, the collection and use of data from such sites is assessed from an ethical frame, and key concepts such as informed consent, information, comprehension, and voluntariness are outlined.
Evaluating authorship distance methods using the positive Silhouette coefficient
- Authors: Layton, Robert , Watters, Paul , Dazeley, Richard
- Date: 2013
- Type: Text , Journal article
- Relation: Natural Language Engineering Vol. 19, no. 4 (2013), p. 517-535
- Full Text:
- Reviewed:
- Description: Unsupervised Authorship Analysis (UAA) aims to cluster documents by authorship without knowing the authorship of any documents. An important factor in UAA is the method for calculating the distance between documents. This choice of the authorship distance method is considered more critical to the end result than the choice of cluster analysis algorithm. One method for measuring the correlation between a distance metric and a labelling (such as class values or clusters) is the Silhouette Coefficient (SC). The SC can be leveraged by measuring the correlation between the authorship distance method and the true authorship, evaluating the quality of the distance method. However, we show that the SC can be severely affected by outliers. To address this issue, we introduce the Positive Silhouette Coefficient, given as the proportion of instances with a positive SC value. This metric is not easily altered by outliers and produces a more robust metric. A large number of authorship distance methods are then compared using the PSC, and the findings are presented. This research provides an insight into the efficacy of methods for UAA and presents a framework for testing authorship distance methods.
- Description: C1
Fake file detection in P2P networks by consensus and reputation
- Authors: Watters, Paul , Layton, Robert
- Date: 2011
- Type: Text , Conference proceedings
- Full Text: false
- Description: Previous research [1] has indicated that reputation scores can be used as the basis for trust computation in P2P networks. In this paper, we use reputation scores calculated from P2P search engine rating sites to determine whether a torrent is likely to be linked to a fake file (or not). Our results indicate clear separability between files which are fake and which are genuine, assuming the integrity of the "community" ratings provided by specific subcultural groups [2]. Suggestions for more sophisticated reputation-based scoring are also provided. © 2011 Crown.
How much material on BitTorrent is infringing content? A case study
- Authors: Watters, Paul , Layton, Robert , Dazeley, Richard
- Date: 2011
- Type: Text , Journal article
- Relation: Information Security Technical Report Vol. 16, no. 2 (2011), p. 79-87
- Full Text: false
- Reviewed:
- Description: BitTorrent is a widely used protocol for peer-to-peer (P2P) file sharing, including material which is often suspected to be infringing content. However, little systematic research has been undertaken to establish to measure the true extent of illegal file sharing. In this paper, we propose a new methodology for measuring the extent of infringing content. Our initial results indicate that at least 89.9% of files shared contain infringing content, with a replication study on another sample finding 97%. We discuss the limitations of the approach in this case study, including sampling biases, and outline proposals to further verify the results. The implications of the work vis - vis the management of piracy at the network level are discussed. © 2011 Published by Elsevier Ltd. All rights reserved.
ICANN or ICANT: Is WHOIS an Enabler of Cybercrime?
- Authors: Watters, Paul , Herps, Aaron , Layton, Robert , McCombie, Stephen
- Date: 2013
- Type: Text , Conference paper
- Relation: Proceedings - 4th Cybercrime and Trustworthy Computing Workshop, CTC 2013 p. 44-49
- Full Text: false
- Reviewed:
- Description: WHOIS acts as a registry for organisations or individuals who 'own' or take responsibility for domains. For any registry to be functional, its integrity needs to be assured. Unfortunately, WHOIS data does not appear to meet basic integrity requirements in many cases, reducing the effectiveness of law enforcement and rightsholders in requesting takedowns for phishing kits, zombie hosts that are part of a botnet, or infringing content. In this paper, we illustrate the problem using a case study from trademark protection, where investigators attempt to trace fake goods being advertised on Facebook. The results indicate that ICANN needs to at least introduce minimum verification standards for WHOIS records vis-Ã -vis integrity, and optimally, develop a system for rapid takedowns in the event that a domain is being misused.
Identifying cyber predators through forensic authorship analysis of chat logs
- Authors: Amuchi, Faith , Al-Nemrat, Ameer , Alazab, Mamoun , Layton, Robert
- Date: 2012
- Type: Text , Conference proceedings
- Full Text: false
- Description: Online Grooming is a growing phenomenon within online environments. One of the major problems encountered in qualitative internet research of chat communication is the issue of anonymity which is being exploited and greatly enjoyed by chatters. An important question that has been asked in the literature is 'How can a researcher be sure to analyse the communication of children and adolescents and not the chat communication of adults who pretend to be under 18?'. Our reply to this question would be the field of Authorship Analysis. Authorship Analysis offers a way to unmask the anonymity of cyber predators. Stylometry, as used in this chat log analysis, is a type of Authorship Analysis that is not based on an author's handwriting but includes contextual clues from the content of their writings. This research paper will analyse the application of different authorship attribution techniques to chat log from a forensic perspective. © 2012 IEEE.
- Description: 2003011054