Meaning-sensitive text data augmentation with intelligent masking
- Authors: Kasthuriarachchy, Buddhika , Chetty, Madhu , Shatte, Adrian , Walls, Darren
- Date: 2023
- Type: Text , Journal article
- Relation: ACM Transactions on Intelligent Systems and Technology Vol. 14, no. 6 (2023), p.
- Full Text: false
- Reviewed:
- Description: With the recent popularity of applying large-scale deep neural network-based models for natural language processing (NLP), attention to develop methods for text data augmentation is at its peak, since the limited size of training data tends to significantly affect the accuracy of these models. To this end, we propose a novel text data augmentation technique called Intelligent Masking with Optimal Substitutions Text Data Augmentation (IMOSA). IMOSA, developed for labelled sentences, can identify the most favourable sentences and locate the appropriate word combinations in a particular sentence to replace and generate synthetic sentences with a meaning closer to the original sentence, while also significantly increasing the diversity of the dataset. We demonstrate that the proposed technique notably improves the performance of classifiers based on attention-based transformer models through the extensive experiments for five different text classification tasks which are performed under the low data regime in a context-Aware NLP setting. The analysis clearly shows that IMOSA effectively generates more sentences using favourable original examples and completely ignores undesirable examples. Furthermore, the experiments carried out confirm IMOSA's ability to add diversity to the augmented dataset using multiple distinct masking patterns against the same original sentence, which remarkably adds variety to the training dataset. IMOSA consistently outperforms the two key masked language model-based text data augmentation techniques, and demonstrates a robust performance against the critical challenging NLP tasks. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Cost effective annotation framework using zero-shot text classification
- Authors: Kasthuriarachchy, Buddhika , Chetty, Madhu , Shatte, Adrian , Walls, Darren
- Date: 2021
- Type: Text , Conference paper
- Relation: 2021 International Joint Conference on Neural Networks, IJCNN 2021 Vol. 2021-July
- Full Text: false
- Reviewed:
- Description: Manual and high-quality annotation of social media data has enabled companies and researchers to develop improved implementations using natural language processing. However, human text-annotation is expensive and time-consuming. Crowd-sourcing platforms such as Amazon's Mechanical Turk (MTurk) can be leveraged for the creation of large training corpora for text classification tasks using social media data. Nevertheless, the quality of annotations can vary significantly, based on the interpretations and motivations of annotators completing the tasks. Further, the labelling cost of data through MTurk will increase if target messages are small and having a significant amount of noise (e.g. promotional messages on Twitter). In this work, we propose a new annotation framework to create high-quality human-annotated datasets for text classification from social media data. We present a zero-shot text classification based pre-annotation technique reducing the adverse effects arising due to the highly skewed distribution of data across target classes. The proposed framework significantly reduces the cost and time while maintaining the quality of the annotations. Being generic, it can be applied to annotating text data from any discipline. Our experiment with a Twitter data annotation using the proposed annotation framework shows a cost reduction of 80% with no compromise to quality. © 2021 IEEE.
From general language understanding to noisy text comprehension
- Authors: Kasthuriarachchy, Buddhika , Chetty, Madhu , Shatte, Adrian , Walls, Darren
- Date: 2021
- Type: Text , Journal article
- Relation: Applied Sciences (Switzerland) Vol. 11, no. 17 (2021), p.
- Full Text:
- Reviewed:
- Description: Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classification accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capability for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.
Pre-trained language models with limited data for intent classification
- Authors: Kasthuriarachchy, Buddhika , Chetty, Madhu , Karmakar, Gour , Walls, Darren
- Date: 2020
- Type: Text , Conference proceedings , Conference paper
- Relation: 2020 International Joint Conference on Neural Networks, IJCNN 2020
- Full Text: false
- Reviewed:
- Description: Intent analysis is capturing the attention of both the industry and academia due to its commercial and noncommercial significance. The rapid growth of unstructured data of micro-blogging platforms, such as Twitter and Facebook, are amongst the important sources for intent analysis. However, the social media data are often noisy and diverse, thus making the task very challenging. Further, the intent analysis frequently suffers from lack of sufficient data because the labeled datasets are often manually annotated. Recently, BERT (Bidirectional Encoder Representation from Transformers), a state-of-the-art language representation model, has attracted attention for accurate language modelling. In this paper, we investigate the application of BERT for its suitability for intent analysis. We study the fine-tuning of the BERT model through inductive transfer learning and investigate methods to overcome the challenges due to limited data availability by proposing a novel semantic data augmentation approach. This technique generates synthetic sentences while preserving the label-compatibility using the semantic meaning of the sentences, to improve the intent classification accuracy. Thus, based on the considerations for finetuning and data augmentation, a systematic and novel step-bystep methodology is presented for applying the linguistic model BERT for intent classification with limited data available. Our results show that the pre-trained language can be effectively used with noisy social media data to achieve state-of-the-art accuracy in intent analysis under low labeled-data regime. Moreover, our results also confirm that the proposed text augmentation technique is effective in eliminating noisy synthetic sentences, thereby achieving further performance improvements. © 2020 IEEE.