Spam email categorization with nlp and using federated deep learning
- Authors: Ul Haq, Ikram , Black, Paul , Gondal, Iqbal , Kamruzzaman, Joarder , Watters, Paul , Kayes, A.
- Date: 2022
- Type: Text , Conference paper
- Relation: 18th International Conference on Advanced Data Mining and Applications, ADMA 2022, Brisbane, Australia, 28-30 November 2022, Advanced Data Mining and Applications, 18th International Conference, ADMA 2022 Vol. 13726 LNAI, p. 15-27
- Full Text: false
- Reviewed:
- Description: Emails are the most popular and efficient communication method that makes them vulnerable to misuse. Federated learning (FL) provides a decentralized machine learning (ML) model, where a central server coordinates clients that collaboratively train a shared ML model. This paper proposes Federated Phishing Filtering (FPF) technique based on federated learning, natural language processing, and deep learning. FL for intelligent algorithms fuses trained models of ML algorithms from multiple sites for collective learning. This approach improves ML performance by utilizing large collective training data sets across the corporate client base, resulting in higher phishing email detection accuracy. FPF techniques preserve email privacy using local feature extraction on client email servers. Thus, the contents of emails do not need to be transmitted across the network or stored on third-party servers. We have applied FL and Natural Language Processing (NLP) for email phishing detection. This technique provides four training modes that perform FL without sharing email content. Our research categorizes emails as benign, spam, and phishing. Empirical evaluations with publicly available datasets show that accuracy is improved by the use of our Federated Deep Learning model. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Rapid health data repository allocation using predictive machine learning
- Authors: Uddin, Ashraf , Stranieri, Andrew , Gondal, Iqbal , Balasubramanian, Venki
- Date: 2020
- Type: Text , Journal article
- Relation: Health Informatics Journal Vol. 26, no. 4 (2020), p. 3009-3036
- Full Text:
- Reviewed:
- Description: Health-related data is stored in a number of repositories that are managed and controlled by different entities. For instance, Electronic Health Records are usually administered by governments. Electronic Medical Records are typically controlled by health care providers, whereas Personal Health Records are managed directly by patients. Recently, Blockchain-based health record systems largely regulated by technology have emerged as another type of repository. Repositories for storing health data differ from one another based on cost, level of security and quality of performance. Not only has the type of repositories increased in recent years, but the quantum of health data to be stored has increased. For instance, the advent of wearable sensors that capture physiological signs has resulted in an exponential growth in digital health data. The increase in the types of repository and amount of data has driven a need for intelligent processes to select appropriate repositories as data is collected. However, the storage allocation decision is complex and nuanced. The challenges are exacerbated when health data are continuously streamed, as is the case with wearable sensors. Although patients are not always solely responsible for determining which repository should be used, they typically have some input into this decision. Patients can be expected to have idiosyncratic preferences regarding storage decisions depending on their unique contexts. In this paper, we propose a predictive model for the storage of health data that can meet patient needs and make storage decisions rapidly, in real-time, even with data streaming from wearable sensors. The model is built with a machine learning classifier that learns the mapping between characteristics of health data and features of storage repositories from a training set generated synthetically from correlations evident from small samples of experts. Results from the evaluation demonstrate the viability of the machine learning technique used. © The Author(s) 2020.
Generative malware outbreak detection
- Authors: Park, Sean , Gondal, Iqbal , Kamruzzaman, Joarder , Oliver, Jon
- Date: 2019
- Type: Text , Conference proceedings , Conference paper
- Relation: 2019 IEEE International Conference on Industrial Technology, ICIT 2019 Vol. 2019-February, p. 1149-1154
- Full Text: false
- Reviewed:
- Description: Recently several deep learning approaches have been attempted to detect malware binaries using convolutional neural networks and stacked deep autoencoders. Although they have shown respectable performance on a large corpus of dataset, practical defense systems require precise detection during the malware outbreaks where only a handful of samples are available. This paper demonstrates the effectiveness of the latent representations obtained through the adversarial autoencoder for malware outbreak detection. Using instruction sequence distribution mapped to a semantic latent vector, the model provides a highly effective neural signature that helps detecting variants of a previously identified malware within a campaign mutated with minor functional upgrade, function shuffling, or slightly modified obfuscations. The method demonstrates how adversarial autoencoder can turn a multiclass classification task into a clustering problem when the sample set size is limited and the distribution is biased. The model performance is evaluated on OS X malware dataset against traditional machine learning models. © 2019 IEEE.
- Description: E1
Instruction cognitive one-shot malware outbreak detection
- Authors: Park, Sean , Gondal, Iqbal , Kamruzzaman, Joarder , Oliver, Jon
- Date: 2019
- Type: Text , Conference paper
- Relation: 26th International Conference on Neural Information Processing [ICONIP 2019] December 12-15 2019, Proceedings, Part IV Vol. 1142, p. 769-778
- Full Text: false
- Reviewed:
- Description: New malware outbreaks cannot provide thousands of training samples which are required to counter malware campaigns. In some cases, there could be just one sample. So, the defense system at the firing line must be able to quickly detect many automatically generated variants using a single malware instance observed from the initial outbreak by tatically inspecting the binary executables. As previous research works show, statistical features such as term frequency-inverse document frequency and n-gram are significantly vulnerable to attacks by mutation through reinforcement learning. Recent studies focus on raw binary executable as a base feature which contains instructions describing the core logic of the sample. However, many approaches using image-matching neural networks are insufficient due to the malware mutation technique that generates a large number of samples with high entropy data. Deriving instruction cognitive representation that disambiguates legitimate instructions from the context is necessary for accurate detection over raw binary executables. In this paper, we present a novel method of detecting semantically similar malware variants within a campaign using a single raw binary malware executable. We utilize Discrete Fourier Transform of instruction cognitive representation extracted from self-attention transformer network. The experiments were conducted with in-the-wild malware samples from ransomware and banking Trojan campaigns. The proposed method outperforms several state of the art binary classification models.
- Description: E1
One-shot malware outbreak detection using spatio-temporal isomorphic dynamic features
- Authors: Park, Sean , Gondal, Iqbal , Kamruzzaman, Joarder , Zhang, Leo
- Date: 2019
- Type: Text , Conference proceedings , Conference paper
- Relation: 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering, TrustCom/BigDataSE 2019 p. 751-756
- Full Text: false
- Reviewed:
- Description: Fingerprinting the malware by its behavioural signature has been an attractive approach for malware detection due to the homogeneity of dynamic execution patterns across different variants of similar families. Although previous researches show reasonably good performance in dynamic detection using machine learning techniques on a large corpus of training set, decisions must be undertaken based upon a scarce number of observable samples in many practical defence scenarios. This paper demonstrates the effectiveness of generative adversarial autoencoder for dynamic malware detection under outbreak situations where in most cases a single sample is available for training the machine learning algorithm to detect similar samples that are in the wild. © 2019 IEEE.
- Description: E1