Spam email categorization with nlp and using federated deep learning
- Authors: Ul Haq, Ikram , Black, Paul , Gondal, Iqbal , Kamruzzaman, Joarder , Watters, Paul , Kayes, A.
- Date: 2022
- Type: Text , Conference paper
- Relation: 18th International Conference on Advanced Data Mining and Applications, ADMA 2022, Brisbane, Australia, 28-30 November 2022, Advanced Data Mining and Applications, 18th International Conference, ADMA 2022 Vol. 13726 LNAI, p. 15-27
- Full Text: false
- Reviewed:
- Description: Emails are the most popular and efficient communication method that makes them vulnerable to misuse. Federated learning (FL) provides a decentralized machine learning (ML) model, where a central server coordinates clients that collaboratively train a shared ML model. This paper proposes Federated Phishing Filtering (FPF) technique based on federated learning, natural language processing, and deep learning. FL for intelligent algorithms fuses trained models of ML algorithms from multiple sites for collective learning. This approach improves ML performance by utilizing large collective training data sets across the corporate client base, resulting in higher phishing email detection accuracy. FPF techniques preserve email privacy using local feature extraction on client email servers. Thus, the contents of emails do not need to be transmitted across the network or stored on third-party servers. We have applied FL and Natural Language Processing (NLP) for email phishing detection. This technique provides four training modes that perform FL without sharing email content. Our research categorizes emails as benign, spam, and phishing. Empirical evaluations with publicly available datasets show that accuracy is improved by the use of our Federated Deep Learning model. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
AFES: An advanced forensic evidence system
- Authors: Black, Paul , Gondal, Iqbal , Brooks, Richard , Yu, Lu
- Date: 2021
- Type: Text , Conference proceedings
- Relation: 2021 IEEE 25th International Enterprise Distributed Object Computing Workshop (EDOCW), Gold Coast, Australia, 25-29th October, 2021 p. 67-74
- Full Text: false
- Reviewed:
- Description: News media often contain reports that raise doubt related to policing operations. We examine the question of how to improve policing integrity during the execution of search warrants and provide an outline for law enforcement search warrants and digital forensic analysis procedures. Existing techniques for improving the integrity of search warrants are reviewed, limitations are noted, and we propose an Advanced Forensic Evidence System (AFES) to address these limitations.AFES provides an immutable record and biometric authentication of the officers present during the execution of a search warrant, time and location, video recording, seizure record, contemporaneous notes, and photographs. AFES records digital evidence items, imaging details, evidence hashes, provides an access control system, and an immutable record of access to all stored items. AFES uses a permissioned distributed ledger prototype, called Scrybe, developed under NSF aegis, to ensure evidence seizure integrity. Scrybe is run as multiple blockchain instances at law enforcement, prosecution, judicial, and defence organisations to ensure that an immutable record is maintained.
Cross-compiler bipartite vulnerability search
- Authors: Black, Paul , Gondal, Iqbal
- Date: 2021
- Type: Text , Journal article
- Relation: Electronics (Switzerland) Vol. 10, no. 11 (2021), p.
- Full Text:
- Reviewed:
- Description: Open-source libraries are widely used in software development, and the functions from these libraries may contain security vulnerabilities that can provide gateways for attackers. This paper provides a function similarity technique to identify vulnerable functions in compiled programs and proposes a new technique called Cross-Compiler Bipartite Vulnerability Search (CCBVS). CCBVS uses a novel training process, and bipartite matching to filter SVM model false positives to improve the quality of similar function identification. This research uses debug symbols in programs compiled from open-source software products to generate the ground truth. This automatic extraction of ground truth allows experimentation with a wide range of programs. The results presented in the paper show that an SVM model trained on a wide variety of programs compiled for Windows and Linux, x86 and Intel 64 architectures can be used to predict function similarity and that the use of bipartite matching substantially improves the function similarity matching performance. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.
Malware variant identification using incremental clustering
- Authors: Black, Paul , Gondal, Iqbal , Bagirov, Adil , Moniruzzaman, Md
- Date: 2021
- Type: Text , Journal article
- Relation: Electronics Vol. 10, no. 14 (2021), p.
- Relation: http://purl.org/au-research/grants/arc/DP190100580
- Full Text:
- Reviewed:
Reanimating historic malware samples
- Authors: Black, Paul , Gondal, Iqbal , Vamplew, Peter , Lakhotia, Arun
- Date: 2021
- Type: Text , Book chapter
- Relation: Malware Analysis Using Artificial Intelligence and Deep Learning p. 345-360
- Full Text: false
- Reviewed:
- Description: Many types of malicious software are controlled from an attacker’s command and control (C2) servers. Anti-virus organizations seek to defeat malware attacks by requesting removal of C2 server Domain Name Server (DNS) records. As a result, the life span of most malware samples is relatively short. Large datasets of historical malware samples are available for countermeasures research. However, due to the age of these malware samples, their C2 servers are no longer available. To cope with high volumes of malware production, malware analysis is increasingly performed using machine learning techniques. Dynamic analysis is commonly used for feature extraction. However, due to the absence of their C2 servers, after initialization, malware samples may exit or loop attempting to establish C2 server connections and, as a result, no longer exhibit their original capabilities. Therefore, partial execution of historical malware samples in a sandbox results in features that differ from those that would be extracted in-the-wild, thus invalidating the results of any machine learning research based on these features. One approach to extracting accurate features is to build an emulated C2 server to provide an environment that allows control of the full capabilities of the malware in an isolated environment. To illustrate the benefits of building C2 server emulators, this chapter provides examples of techniques for the creation of C2 server emulators for three malware families (Zeus, CryptoWall, and CryptoLocker) using manual reverse engineering techniques and a review of semi-automated techniques for the construction of C2 server emulators.
API based discrimination of ransomware and benign cryptographic programs
- Authors: Black, Paul , Sohail, Ammar , Gondal, Iqbal , Kamruzzaman, Joarder , Vamplew, Peter , Watters, Paul
- Date: 2020
- Type: Text , Conference paper
- Relation: 27th International Conference on Neural Information Processing, ICONIP 2020, Bangkok, 18 to 22 November 2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 12533 LNCS, p. 177-188
- Full Text: false
- Reviewed:
- Description: Ransomware is a widespread class of malware that encrypts files in a victim’s computer and extorts victims into paying a fee to regain access to their data. Previous research has proposed methods for ransomware detection using machine learning techniques. However, this research has not examined the precision of ransomware detection. While existing techniques show an overall high accuracy in detecting novel ransomware samples, previous research does not investigate the discrimination of novel ransomware from benign cryptographic programs. This is a critical, practical limitation of current research; machine learning based techniques would be limited in their practical benefit if they generated too many false positives (at best) or deleted/quarantined critical data (at worst). We examine the ability of machine learning techniques based on Application Programming Interface (API) profile features to discriminate novel ransomware from benign-cryptographic programs. This research provides a ransomware detection technique that provides improved detection accuracy and precision compared to other API profile based ransomware detection techniques while using significantly simpler features than previous dynamic ransomware detection research. © 2020, Springer Nature Switzerland AG.
Function similarity using family context
- Authors: Black, Paul , Gondal, Iqbal , Vamplew, Peter , Lakhotia, Arun
- Date: 2020
- Type: Text , Journal article
- Relation: Electronics Vol. 9, no. 7 (Jul 2020), p. 20
- Full Text:
- Reviewed:
- Description: Finding changed and similar functions between a pair of binaries is an important problem in malware attribution and for the identification of new malware capabilities. This paper presents a new technique called Function Similarity using Family Context (FSFC) for this problem. FSFC trains a Support Vector Machine (SVM) model using pairs of similar functions from two program variants. This method improves upon previous research called Cross Version Contextual Function Similarity (CVCFS) e epresenting a function using features extracted not just from the function itself, but also, from other functions with which it has a caller and callee relationship. We present the results of an initial experiment that shows that the use of additional features from the context of a function significantly decreases the false positive rate, obviating the need for a separate pass for cleaning false positives. The more surprising and unexpected finding is that the SVM model produced by FSFC can abstract function similarity features from one pair of program variants to find similar functions in an unrelated pair of program variants. If validated by a larger study, this new property leads to the possibility of creating generic similar function classifiers that can be packaged and distributed in reverse engineering tools such as IDA Pro and Ghidra.
- Description: This research was performed in the Internet Commerce Security Lab (ICSL), which is a joint venture with research partners Westpac, IBM, and Federation University Australia.
Identifying cross-version function similarity using contextual features
- Authors: Black, Paul , Gondal, Iqbal , Vamplew, Peter , Lakhotia, Arun
- Date: 2020
- Type: Text , Conference paper
- Relation: 19th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2020 p. 810-818
- Full Text: false
- Reviewed:
- Description: The identification of similar functions in malware assists analysis by supporting the exclusion of functions that have been previously analysed, allows the identification of new variants, supports authorship attribution, and the analysis of malware phylogeny. A function's context is a set comprising the function itself and all the program functions that may be executed when this function is called. Contextual features consist of data that is extracted from the functions contained in the function context. This paper presents a novel technique called Cross Version Contextual Function Similarity (CVCFS) to identify function pairs in two programs using features based on both individual functions and function context. The CVCFS technique uses Support Vector Machine (SVM) machine learning of function similarity features to pre-filter function pairs and then applies an edit distance technique using function semantics to reduce false positives. A case study is provided where individual and contextual features are extracted from three versions of Zeus malware. The SVM pre-filtering, followed by the use of an edit distance technique to filter false positives, gives a function pair identification accuracy of 85 percent. © 2020 IEEE.
Techniques for the reverse engineering of banking malware
- Authors: Black, Paul
- Date: 2020
- Type: Text , Thesis , PhD
- Full Text:
- Description: Malware attacks are a significant and frequently reported problem, adversely affecting the productivity of organisations and governments worldwide. The well-documented consequences of malware attacks include financial loss, data loss, reputation damage, infrastructure damage, theft of intellectual property, compromise of commercial negotiations, and national security risks. Mitiga-tion activities involve a significant amount of manual analysis. Therefore, there is a need for automated techniques for malware analysis to identify malicious behaviours. Research into automated techniques for malware analysis covers a wide range of activities. This thesis consists of a series of studies: an anal-ysis of banking malware families and their common behaviours, an emulated command and control environment for dynamic malware analysis, a technique to identify similar malware functions, and a technique for the detection of ransomware. An analysis of the nature of banking malware, its major malware families, behaviours, variants, and inter-relationships are provided in this thesis. In doing this, this research takes a broad view of malware analysis, starting with the implementation of the malicious behaviours through to detailed analysis using machine learning. The broad approach taken in this thesis differs from some other studies that approach malware research in a more abstract sense. A disadvantage of approaching malware research without domain knowledge, is that important methodology questions may not be considered. Large datasets of historical malware samples are available for countermea-sures research. However, due to the age of these samples, the original malware infrastructure is no longer available, often restricting malware operations to initialisation functions only. To address this absence, an emulated command and control environment is provided. This emulated environment provides full control of the malware, enabling the capabilities of the original in-the-wild operation, while enabling feature extraction for research purposes. A major focus of this thesis has been the development of a machine learn-ing function similarity method with a novel feature encoding that increases feature strength. This research develops techniques to demonstrate that the machine learning model trained on similarity features from one program can find similar functions in another, unrelated program. This finding can lead to the development of generic similar function classifiers that can be packaged and distributed in reverse engineering tools such as IDA Pro and Ghidra. Further, this research examines the use of API call features for the identi-fication of ransomware and shows that a failure to consider malware analysis domain knowledge can lead to weaknesses in experimental design. In this case, we show that existing research has difficulty in discriminating between ransomware and benign cryptographic software. This thesis by publication, has developed techniques to advance the disci-pline of malware reverse engineering, in order to minimize harm due to cyber-attacks on critical infrastructure, government institutions, and industry.
- Description: Doctor of Philosophy
Evolved similarity techniques in malware analysis
- Authors: Black, Paul , Gondal, Iqbal , Vamplew, Peter , Lakhotia, Arun
- Date: 2019
- Type: Text , Conference proceedings
- Relation: 2019 18th IEEE International Conference On Trust, Security And Privacy; published in In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), 5-8th Aug, 2019 p. 404-410
- Full Text: false
- Reviewed:
- Description: Malware authors are known to reuse existing code, this development process results in software evolution and a sequence of versions of a malware family containing functions that show a divergence from the initial version. This paper proposes the term evolved similarity to account for this gradual divergence of similarity across the version history of a malware family. While existing techniques are able to match functions in different versions of malware, these techniques work best when the version changes are relatively small. This paper introduces the concept of evolved similarity and presents automated Evolved Similarity Techniques (EST). EST differs from existing malware function similarity techniques by focusing on the identification of significantly modified functions in adjacent malware versions and may also be used to identify function similarity in malware samples that differ by several versions. The challenge in identifying evolved malware function pairs lies in identifying features that are relatively invariant across evolved code. The research in this paper makes use of the function call graph to establish these features and then demonstrates the use of these techniques using Zeus malware.
Mining malware secrets
- Authors: Lakhotia, Arun , Black, Paul
- Date: 2017
- Type: Text , Conference proceedings , Conference paper
- Relation: 2017 12th International Conference On Malicious and Unwanted Software, MALWARE 2017; Fajardo, United States; 11th-14th October 2017 Vol. 2018, p. 11-18
- Full Text: false
- Reviewed:
- Description: Malware analysts, besides being tasked to create signatures, are also called upon to generate indicators of compromise, to disrupt botnets, to attribute an attack to an actor, and to understand the adversary's intent. This requires extracting from malware a variety of secrets, aka threat intelligence. After studying a few samples from a malware family and locating where its secrets are embedded, analysts create rules that may be used to automatically extract threat intelligence from malware variants in the future. Rules to extract secrets from malware are today written as regular expressions over bytecodes, such as using Yara. These rules are easily invalidated by polymorphic variants or evolutionary versions. Keeping the rules updated is a maintenance challenge for malware analysts. Instead of using bytecode, we present the use of code semantics to create rules to extract malware secrets. The semantics of code captures the effect of instructions on the registers and memory. Rules written using the structure of the symbolic content of registers and memory, instead of bytecode, are more resilient to code transformation and evolutionary changes, and are thus less brittle and easier to maintain.
Anti-analysis trends in banking malware
- Authors: Black, Paul , Opacki, Joseph
- Date: 2016
- Type: Text , Conference proceedings
- Relation: 2016 11th International Conference on Malicious and Unwanted Software (MALWARE), Fajardo, PR, USA, 18-21 October 2016 p. 1-7
- Full Text: false
- Reviewed:
- Description: Banking Malware, has become a popular and ever more prevalent mechanism to monetise malware development. Since the development of the Zeus malware kit in 2007, the frequency and complexity of banking malware has been increasing. Developing a good understanding of the operation of a malware family is a first step in the reverse engineering required to create tools to extract the malware configuration, which is used in the remediation of malware infrastructure. This reverse engineering process in recent years has become increasingly challenging. This manuscript provides a brief summary of the reverse engineering of banking malware families over a two year period and emphasises the anti-analysis techniques employed by the authors of six families of banking malware. The manuscript presents this analysis, and examines trends in the development of these anti-analysis techniques.
Be careful who you trust: Issues with the public key infrastructure
- Authors: Black, Paul , Layton, Robert
- Type: Text , Conference paper
- Relation: 5th Cybercrime and Trustworthy Computing Conference, CTC 2014, Auckland, 24-25th November, 2014 published in Cybercrime and Trustworthy Computing Conference (CTC) p. 12-21
- Full Text: false
- Reviewed:
- Description: The modern digital internet economy and billions of dollars of trade are made possible by the internet security which is provided by operating system and web browser developers. This paper provides a survey of how this security is implemented through the use of digital certificates and the Public Key Infrastructure. Documented cases of the abuse of these digital certificates are given. It is shown that these problems arise from a combination of commercial pressures and a failure of the designers of internet security to consider the fundamental security principal of least privilege. Measures which are used to mitigate these problems are noted and new PKI architectural components which are designed to correct existing problems are examined.