- Title
- From general language understanding to noisy text comprehension
- Creator
- Kasthuriarachchy, Buddhika; Chetty, Madhu; Shatte, Adrian; Walls, Darren
- Date
- 2021
- Type
- Text; Journal article
- Identifier
- http://researchonline.federation.edu.au/vital/access/HandleResolver/1959.17/179717
- Identifier
- vital:15654
- Identifier
-
https://doi.org/10.3390/app11177814
- Identifier
- ISBN:2076-3417 (ISSN)
- Abstract
- Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classification accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capability for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.
- Publisher
- MDPI
- Relation
- Applied Sciences (Switzerland) Vol. 11, no. 17 (2021), p.
- Rights
- All metadata describing materials held in, or linked to, the repository is freely available under a CC0 licence
- Rights
- https://creativecommons.org/licenses/by/4.0/
- Rights
- Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland.
- Rights
- Open Access
- Subject
- 0102 Applied Mathematics; 0204 Condensed Matter Physics; Language understanding; Noisy text; Probing tasks; Sentence representation
- Full Text
- Reviewed
- Funder
- This research is supported by Global Hosts Pty Ltd. trading as SportsHosts, a Melbourne-based company.
- Hits: 1602
- Visitors: 1485
- Downloads: 98
Thumbnail | File | Description | Size | Format | |||
---|---|---|---|---|---|---|---|
View Details Download | SOURCE1 | Published version | 462 KB | Adobe Acrobat PDF | View Details Download |