Applications of machine learning for linguistic analysis of texts
- Authors: Torney, Rosemary , Yearwood, John , Vamplew, Peter , Kelarev, Andrei
- Date: 2012
- Type: Text , Book chapter
- Relation: Machine Learning Algorithms for Problem Solving in Computational Applications: Intelligent Techniques p. 133-148
- Full Text: false
- Reviewed:
- Description: This chapter describes a novel multistage method for linguistic clustering of large collections of texts available on the Internet as a precursor to linguistic analysis of these texts. This method addresses the practicalities of applying clustering operations to a very large set of text documents by using a combination of unsupervised clustering and supervised classification. The method relies on creating a multitude of independent clusterings of a randomized sample selected from the International Corpus of Learner English. Several consensus functions and sophisticated algorithms are applied in two substages to combine these independent clusterings into one final consensus clustering, which is then used to train fast classifiers in order to enable them to perform the profiling of very large collections of text and web data. This approach makes it possible to apply advanced highly accurate and sophisticated clustering techniques by combining them with fast supervised classification algorithms. For the effectiveness of this multistage method it is crucial to determine how well the supervised classification algorithms are going to perform at the final stage, when they are used to process large data sets available on the Internet. This performance may also serve as an indication of the quality of the combined consensus clustering obtained in the preceding stages. The authors' experimental results compare the performance of several classification algorithms incorporated in this multistage scheme and demonstrate that several of these classification algorithms achieve very high precision and recall and can be used in practical implementations of their method.
Using psycholinguistic features for profiling first language of authors
- Authors: Torney, Rosemary , Vamplew, Peter , Yearwood, John
- Date: 2012
- Type: Text , Journal article
- Relation: Journal of the American Society for Information Science and Technology Vol. 63, no. 6 (2012), p. 1256-1269
- Full Text: false
- Reviewed:
- Description: This study empirically evaluates the effectiveness of different feature types for the classification of the first language of an author. In particular, it examines the utility of psycholinguistic features, extracted by the Linguistic Inquiry and Word Count (LIWC) tool, that have not previously been applied to the task of author profiling. As LIWC is a tool that has been developed in the psycholinguistic field rather than the computational linguistics field, it was hypothesized that it would be effective, both as a single type feature set because of its psycholinguistic basis, and in combination with other feature sets, because it should be sufficiently different to add insight rather than redundancy. It was found that LIWC features were competitive with previously used feature types in identifying the first language of an author, and that combined feature sets including LIWC features consistently showed better accuracy rates and average F measures than were achieved by the same feature sets without the LIWC features. As a secondary issue, this study also examined how effectively first language classification scaled up to a larger number of possible languages. It was found that the classification scheme scaled up effectively to the entire 16 language collection from the International Corpus of Learner English, when compared with results achieved on just 5 languages in previous research. 2012 ASIS&T.
Optimization and matrix constructions for classification of data
- Authors: Kelarev, Andrei , Yearwood, John , Vamplew, Peter , Abawajy, Jemal , Chowdhury, Morshed
- Date: 2011
- Type: Journal article
- Relation: New Zealand Journal of Mathematics Vol. 41, no. 2011 (2011), p. 65-73
- Full Text:
- Reviewed:
- Description: Max-plus alegbras and more general semirings have many useful applications and have been actively investigated. On the other hand, structural matrix rings are also well known and have been considered by many authors. The main theorem of this article completely describes all optimal ideas in the more general structural matrix semirings. Originally, our investigation of these ideals was motivated by applications in data mining for the design of multiple classification systems combining several individual classifiers.
Reinforcement learning approach to AIBO robot's decision making process in Robosoccer's goal keeper problem
- Authors: Mukherjee, Subhasis , Yearwood, John , Vamplew, Peter , Huda, Shamsul
- Date: 2011
- Type: Text , Conference proceedings
- Full Text: false
- Description: Robocup is a popular test bed for AI programs around the world. Robosoccer is one of the two major parts of Robocup, in which AIBO entertainment robots take part in the middle sized soccer event. The three key challenges that robots need to face in this event are manoeuvrability, image recognition and decision making skills. This paper focuses on the decision making problem in Robosoccer - The goal keeper problem. We investigate whether reinforcement learning (RL) as a form of semi-supervised learning can effectively contribute to the goal keeper's decision making process when penalty shot and two attacker problem are considered. Currently, the decision making process in Robosoccer is carried out using rule-base system. RL also is used for quadruped locomotion and navigation purpose in Robosoccer using AIBO. In this paper, we propose a reinforcement learning based approach that uses a dynamic state-action mapping using back propagation of reward and space quantized Q-learning (SQQL) for the choice of high level functions in order to save the goal. The novelty of our approach is that the agent learns while playing and can take independent decision which overcomes the limitations of rule-base system due to fixed and limited predefined decision rules. Performance of the proposed method has been verified against the bench mark data set made with Upenn'03 code logic. It was found that the efficiency of our SQQL approach in goalkeeping was better than the rule based approach. The SQQL develops a semi-supervised learning process over the rule-base system's input-output mapping process, given in the Upenn'03 code. © 2011 IEEE.
Automated opinion detection : Implications of the level of agreement between human raters
- Authors: Osman, Deanna , Yearwood, John , Vamplew, Peter
- Date: 2010
- Type: Text , Journal article
- Relation: Information Processing and Management Vol. 46, no. 3 (2010), p. 331-342
- Full Text: false
- Reviewed:
- Description: The ability to agree with the TREC Blog06 opinion assessments was measured for seven human assessors and compared with the submitted results of the Blog06 participants. The assessors achieved a fair level of agreement between their assessments, although the range between the assessors was large. It is recommended that multiple assessors are used to assess opinion data, or a pre-test of assessors is completed to remove the most dissenting assessors from a pool of assessors prior to the assessment process. The possibility of inconsistent assessments in a corpus also raises concerns about training data for an automated opinion detection system (AODS), so a further recommendation is that AODS training data be assembled from a variety of sources. This paper establishes an aspirational value for an AODS by determining the level of agreement achievable by human assessors when assessing the existence of an opinion on a given topic. Knowing the level of agreement amongst humans is important because it sets an upper bound on the expected performance of AODS. While the AODSs surveyed achieved satisfactory results, none achieved a result close to the upper bound. © 2009 Elsevier Ltd. All rights reserved.
A polynomial ring construction for the classification of data
- Authors: Kelarev, Andrei , Yearwood, John , Vamplew, Peter
- Date: 2009
- Type: Text , Journal article
- Relation: Bulletin of the Australian Mathematical Society Vol. 79, no. 2 (2009), p. 213-225
- Full Text:
- Reviewed:
- Description: Drensky and Lakatos (Lecture Notes in Computer Science, 357 (Springer, Berlin, 1989), pp. 181-188) have established a convenient property of certain ideals in polynomial quotient rings, which can now be used to determine error-correcting capabilities of combined multiple classifiers following a standard approach explained in the well-known monograph by Witten and Frank (Data Mining: Practical Machine Learning Tools and Techniques (Elsevier, Amsterdam, 2005)). We strengthen and generalise the result of Drensky and Lakatos by demonstrating that the corresponding nice property remains valid in a much larger variety of constructions and applies to more general types of ideals. Examples show that our theorems do not extend to larger classes of ring constructions and cannot be simplified or generalised.
Applying clustering and ensemble clustering approaches to phishing profiling
- Authors: Webb, Dean , Yearwood, John , Vamplew, Peter , Ma, Liping , Ofoghi, Bahadorreza , Kelarev, Andrei
- Date: 2009
- Type: Text , Conference paper
- Relation: Paper presented at Eighth Australasian Data Mining Conference, AusDM 2009, University of Melbourne, Melbourne, Victoria : 1st–4th December 2009
- Full Text:
- Description: 2003007911
MRF model based unsupervised color textured image segmentation using multidimensional spatially variant finite mixture model
- Authors: Islam, Mofakharul , Vamplew, Peter , Yearwood, John
- Date: 2009
- Type: Text , Book chapter
- Relation: Technological developments in Education and Automation p. 375-380
- Full Text: false
- Reviewed:
- Description: We investigate and propose a novel approach to implement an unsupervised color image segmentation model that segments a color image meaningfully and partitions into its constituent parts automatically. The aim is to devise a robust unsupervised segmentation approach that can segment a color textured image more accurately. Here, color and texture information of each individual pixel along with the spatial relationship within its neighborhood have been considered for producing more accuracy in segmentation. In this particular work, the problem we want to investigate is to implement a robust unsupervised Multidimensional Spatially Variant Finite Mixture Model (MSVFMM) based color image segmentation approach using Cluster Ensembles and MRF model along with Daubechies wavelet transforms for increasing the content sensitivity of the segmentation model in order to get a better accuracy in segmentation. Here, Cluster Ensemble has been utilized as a robust automatic tool for finding the number of components in an image. The main idea behind this work is introducing a Bayesian inference based approach to estimate the Maximum a Posteriori (MAP) to identify the different objects/components in a color image. Markov Random Field (MRF) plays a crucial role in capturing the relationships among the neighboring pixels. An Expectation Maximization (EM) model fitting MAP algorithm segments the image utilizing the pixel’s color and texture features and the captured neighborhood relationships among them. The algorithm simultaneously calculates the model parameters and segments the pixels iteratively in an interleaved manner. Finally, it converges to a solution where the model parameters and pixel labels are stabilized within a specified criterion. Finally, we have compared our results with another recent segmentation approach [10], which is similar in nature. The experimental results reveal that the proposed approach is capable of producing more accurate and faithful segmentation and can be employed in different practical image content understanding applications.
Unsupervised segmentation of Industrial Images using Markov Random Field Model
- Authors: Islam, Mofakharul , Yearwood, John , Vamplew, Peter
- Date: 2009
- Type: Text , Book chapter
- Relation: Technogical Developments in Education and Automation p. 369-374
- Full Text: false
- Reviewed:
- Description: We propose a novel approach to investigate and implement unsupervised image content understanding and segmentation of color industrial images like medical imaging, forensic imaging, security and surveillance imaging, biotechnical imaging, biometrics, mineral and mining imaging, material science imaging, and many more. In this particular work, our focus will be on medical images only. The aim is to develop a computer aided diagnosis (CAD) system based on a newly developed Multidimensional Spatially Variant Finite Mixture Model (MSVFMM) using Markov Random Fields (MRF) Model. Unsupervised means automatic discovery of classes or clusters in images rather than generating the class or cluster descriptions from training image sets. The aim of this work is to produce precise segmentation of color medical images on the basis of subtle color and texture variation. Finer segmentation of images has tremendous potential in medical imaging where subtle information related to color and texture is required to analyze the image accurately. In this particular work, we have used CIE-Luv and Daubechies wavelet transforms as color and texture descriptors respectively. Using the combined effect of a CIE-Luv color model and Daubechies transforms, we can segment color medical images precisely in a meaningful manner. The evaluation of the results is done through comparison of the segmentation quality with another similar alternative approach and it is found that the proposed approach is capable of producing more faithful segmentation.
Weblogs for market research : Finding more relevant opinion documents using system fusion
- Authors: Osman, Deanna , Yearwood, John , Vamplew, Peter
- Date: 2009
- Type: Text , Journal article
- Relation: Online Information Review Vol. 33, no. 5 (2009), p. 873-888
- Full Text: false
- Reviewed:
- Description: Purpose - The purpose of this paper is to examine the usefulness of fusion as a means of improving the precision of automated opinion detection. Design/methodology/approach - Five system fusion methods are proposed and tested using runs submitted by the Text REtrieval Conference (TREC) Blog06 participants as input. The methods include a voting method, an inverse rank method (IRM), a linear-normalised score method and two weighted methods that use a weighted IRM score to rank the document. Findings - Mean average precision (MAP) is used as an indicator of the performance of the runs in this study. The best system fusion method achieves a 55.5 percent higher MAP result compared with the highest MAP result of any individual run submitted by the Blog06 participants. This equates to an increase in detection of 2,398 relevant opinion documents (21 percent). Practical implications - System fusion can be used to improve upon the results achieved by existing individual opinion detection systems. On the other hand, multiple opinion detection approaches can be combined into one system and fusion used to combine the results to build in diversity. Diversity within fusion inputs can increase the improvements achieved by fusion methods. The improved output from a diverse opinion detection system will then contain a higher number of relevant documents and reduce the incidence of high-ranking non-relevant documents and low-ranking relevant documents. Originality/value - The fusion methods proposed in this study demonstrate that simple fusion of opinion detection systems can improve performance.
On the limitations of scalarisation for multi-objective reinforcement learning of Pareto fronts
- Authors: Vamplew, Peter , Yearwood, John , Dazeley, Richard , Berry, Adam
- Date: 2008
- Type: Text , Conference paper
- Relation: Paper presented at 21st Australasian Joint Conference on Artificial Intelligence, Auckland, New Zealand : 1st-5th December 2008 Vol. 5360, p. 372-378
- Full Text: false
- Description: Multiobjective reinforcement learning (MORL) extends RL to problems with multiple conflicting objectives. This paper argues for designing MORL systems to produce a set of solutions approximating the Pareto front, and shows that the common MORL technique of scalarisation has fundamental limitations when used to find Pareto-optimal policies. The work is supported by the presentation of three new MORL benchmarks with known Pareto fronts.
- Description: 2003006504
Unsupervised color textured image segmentation using cluster ensembles and MRF mdel
- Authors: Islam, Mofakharul , Yearwood, John , Vamplew, Peter
- Date: 2008
- Type: Text , Book chapter
- Relation: Advances in computer and information sciences and engineering p. 323-328
- Full Text: false
- Reviewed:
- Description: We propose a novel approach to implement robust unsupervised color image content understanding approach that segments a color image into its constituent parts automatically. The aim of this work is to produce precise segmentation of color images using color and texture information along with neighborhood relationships among image pixels which will provide more accuracy in segmentation. Here, unsupervised means automatic discovery of classes or clusters in images rather than generating the class or cluster descriptions from training image sets. As a whole, in this particular work, the problem we want to investigate is to implement a robust unsupervised SVFM model based color medical image segmentation tool using Cluster Ensembles and MRF model along with wavelet transforms for increasing the content sensitivity of the segmentation model. In addition, Cluster Ensemble has been utilized for introducing a robust technique for finding the number of components in an image automatically. The experimental results reveal that the proposed tool is able to find the accurate number of objects or components in a color image and eventually capable of producing more accurate and faithful segmentation and can. A statistical model based approach has been developed to estimate the Maximum a posteriori (MAP) to identify the different objects/components in a color image. The approach utilizes a Markov Random Field model to capture the relationships among the neighboring pixels and integrate that information into the Expectation Maximization (EM) model fitting MAP algorithm. The algorithm simultaneously calculates the model parameters and segments the pixels iteratively in an interleaved manner. Finally, it converges to a solution where the model parameters and pixel labels are stabilized within a specified criterion. Finally, we have compared our results with another well-known segmentation approach.
Weblogs for market research : Improving opinion detection using system fusion
- Authors: Osman, Deanna , Yearwood, John , Vamplew, Peter
- Date: 2008
- Type: Text , Conference paper
- Relation: Paper presented at International Conference on Service Systems and Service Management, 2008, Melbourne, Victoria : 30th June - 2nd July 2008 p. 1-6
- Full Text:
- Description: Searching for opinions on a specific product or service within blogs is a new frontier for market researchers. This research investigates the use of system fusion methods to improve mean average precision (MAP) results achieved by the Text REtrieval Conference (TREC) Blog06 participants and reports the improved MAP results. It is hypothesized that diversity of the inputs is vital to maximising the MAP improvements. This is shown in the improvement in MAP values achieved by some of the participantpsilas ranked lists. The growth in the number of blog authors who write valuable opinions about their life experiences has led to an unsolicited resource of opinions on products, politics and services. In 2006, TREC collected blogs and set a task of detecting opinions on given topics to their participants, reporting the results using MAP.
- Description: 2003007757
Using corpus analysis to inform research into opinion detection in blogs
- Authors: Osman, Deanna , Yearwood, John , Vamplew, Peter
- Date: 2007
- Type: Text , Conference paper
- Relation: Paper presented at Sixth Australasian Data Mining Conference, AusDM 2007, Gold Coast, Queensland, Victoria : 3rd-4th December 2007 p. 65-75
- Full Text:
- Description: Opinion detection research relies on labeled documents for training data, either by assumptions based on the document's origin or by using human assessors to categorise the documents. In recent years, blogs have become a source for opinion identification research (TREC Blog06). This study analyses the part-of-speech proportion and the words used within various corpora, determining key differences and similarities useful when preparing for opinion identification research. The resulting comparisons between the characteristics of the various corpora is detailed and discussed. In particular, opinion bearing and non opinion Blog06 documents were found to display a high level of similarity, indicating that blog documents assessed at the document level cannot be used as training data in opinion identification research.
- Description: 2003004892