Optimization based clustering and classification algorithms in analysis of microarray gene expression data sets
- Authors: Mardaneh, Karim
- Date: 2007
- Type: Text , Thesis , PhD
- Full Text:
- Description: Doctor of Philosophy
- Description: Bioinformatics and computational biology are relatively new areas that involve the use of different techniques including computer science, informatics, biochemistry, applied math and etc., to solve biological problems. In recent years the development of new molecular genetics technologies, such as DNA microarrays led to the simultaneous measurement of expression levels of thousands and even tens of thousands of genes. Microarray gene expression technology has facilitated the study of genomic structure and investigation of biological systems. Numerical output of this technology is shown as microarray gene expression data sets. These data sets contain a very large number of genes and a relatively small number of samples and their precise analysis requires a robust and suitable computer software. Due to this, only a few existing algorithms are applicable to them, so more efficient methods for solving clustering, gene selection and classification problems of gene expression data sets are required and those methods need to be computationally applicable and less expensive. The aim of this thesis is to develop new algorithms for solving clustering, gene selection and data classification problems on gene expression data sets. Clustering in gene expression data sets is a challenging problem. The increasing use of DNA microarray-based tumour gene expression profiles for cancer diagnosis requires more efficient methods to solve clustering problems of these profiles. Different algorithms for clustering of genes have been proposed, however few algorithms can be applied to the clustering of samples. k-means algorithm, among very few clustering algorithms is applicable to microarray gene expression data sets, however these are not efficient for solving clustering problems when the number of genes is thousands and this algorithm is very sensitive to the choice of a starting point. Additionally, when the number of clusters is relatively large, this algorithm gives local minima which can differ significantly from the global solution. Over the last several years different approaches have been proposed to improve global ii Abstract Abstract search properties of k-means algorithm. One of them is the global k-means algorithm, however this algorithm is not efficient when data are sparse. In this thesis we developed a new version of the global k-means algorithm, the modified global k-means algorithm which is effective for solving clustering problems in gene expression data sets. In a microarray gene expression data set, in many cases only a small fraction of genes are informative whereas most of them are non-informative and make noise. Therefore the development of gene selection algorithms that allow us to remove as many non-informative genes as possible is very important. In this thesis we developed a new overlapping gene selection algorithm. This algorithm is based on calculating overlaps of different genes. It considerably reduces the number of genes and is efficient in finding a subset of informative genes. Over the last decade different approaches have been proposed to solve supervised data classification problems in gene expression data sets. In this thesis we developed a new approach which is based on the so-called max-min separability and is compared with the other approaches. The max-min separability algorithm is an equivalent of piecewise linear separability. An incremental algorithm is presented to compute piecewise linear functions separating two sets. This algorithm is applied along with a special gene selection algorithm. In this thesis, all new algorithms have been tested on 10 publicly available gene expression data sets and our numerical results demonstrate the efficiency of the new algorithms that were developed in the framework of this research
- Authors: Mardaneh, Karim
- Date: 2007
- Type: Text , Thesis , PhD
- Full Text:
- Description: Doctor of Philosophy
- Description: Bioinformatics and computational biology are relatively new areas that involve the use of different techniques including computer science, informatics, biochemistry, applied math and etc., to solve biological problems. In recent years the development of new molecular genetics technologies, such as DNA microarrays led to the simultaneous measurement of expression levels of thousands and even tens of thousands of genes. Microarray gene expression technology has facilitated the study of genomic structure and investigation of biological systems. Numerical output of this technology is shown as microarray gene expression data sets. These data sets contain a very large number of genes and a relatively small number of samples and their precise analysis requires a robust and suitable computer software. Due to this, only a few existing algorithms are applicable to them, so more efficient methods for solving clustering, gene selection and classification problems of gene expression data sets are required and those methods need to be computationally applicable and less expensive. The aim of this thesis is to develop new algorithms for solving clustering, gene selection and data classification problems on gene expression data sets. Clustering in gene expression data sets is a challenging problem. The increasing use of DNA microarray-based tumour gene expression profiles for cancer diagnosis requires more efficient methods to solve clustering problems of these profiles. Different algorithms for clustering of genes have been proposed, however few algorithms can be applied to the clustering of samples. k-means algorithm, among very few clustering algorithms is applicable to microarray gene expression data sets, however these are not efficient for solving clustering problems when the number of genes is thousands and this algorithm is very sensitive to the choice of a starting point. Additionally, when the number of clusters is relatively large, this algorithm gives local minima which can differ significantly from the global solution. Over the last several years different approaches have been proposed to improve global ii Abstract Abstract search properties of k-means algorithm. One of them is the global k-means algorithm, however this algorithm is not efficient when data are sparse. In this thesis we developed a new version of the global k-means algorithm, the modified global k-means algorithm which is effective for solving clustering problems in gene expression data sets. In a microarray gene expression data set, in many cases only a small fraction of genes are informative whereas most of them are non-informative and make noise. Therefore the development of gene selection algorithms that allow us to remove as many non-informative genes as possible is very important. In this thesis we developed a new overlapping gene selection algorithm. This algorithm is based on calculating overlaps of different genes. It considerably reduces the number of genes and is efficient in finding a subset of informative genes. Over the last decade different approaches have been proposed to solve supervised data classification problems in gene expression data sets. In this thesis we developed a new approach which is based on the so-called max-min separability and is compared with the other approaches. The max-min separability algorithm is an equivalent of piecewise linear separability. An incremental algorithm is presented to compute piecewise linear functions separating two sets. This algorithm is applied along with a special gene selection algorithm. In this thesis, all new algorithms have been tested on 10 publicly available gene expression data sets and our numerical results demonstrate the efficiency of the new algorithms that were developed in the framework of this research
A novel canonical dual computational approach for prion AGAAAAGA amyloid fibril molecular modeling
- Zhang, Jiapu, Gao, David, Yearwood, John
- Authors: Zhang, Jiapu , Gao, David , Yearwood, John
- Date: 2011
- Type: Text , Journal article
- Relation: Journal of Theoretical Biology Vol. 284, no. 1 (2011), p. 149-157
- Full Text: false
- Reviewed:
- Description: Many experimental studies have shown that the prion AGAAAAGA palindrome hydrophobic region (113-120) has amyloid fibril forming properties and plays an important role in prion diseases. However, due to the unstable, noncrystalline and insoluble nature of the amyloid fibril, to date structural information on AGAAAAGA region (113-120) has been very limited. This region falls just within the N-terminal unstructured region PrP (1-123) of prion proteins. Traditional X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy experimental methods cannot be used to get its structural information. Under this background, this paper introduces a novel approach of the canonical dual theory to address the 3D atomic-resolution structure of prion AGAAAAGA amyloid fibrils. The novel and powerful canonical dual computational approach introduced in this paper is for the molecular modeling of prion AGAAAAGA amyloid fibrils, and that the optimal atomic-resolution structures of prion AGAAAAGA amyloid fibils presented in this paper are useful for the drive to find treatments for prion diseases in the field of medicinal chemistry. Overall, this paper presents an important method and provides useful information for treatments of prion diseases. © 2011.
Towards large scale genetic network modeling
- Khan, Rubaiya Rahtin, Chetty, Madhu
- Authors: Khan, Rubaiya Rahtin , Chetty, Madhu
- Date: 2015
- Type: Text , Conference proceedings
- Full Text: false
- Description: Reverse Engineering Gene Regulatory Networks (GRNs) is an important and challenging problem of Systems Biology. For its superiority in both structure and parameter learning, the S-system model framework is often chosen for GRN reconstruction. The biggest challenge in reconstructing GRNs is the data having large number of genes and only a small number of samples. This "curse of dimensionality", along with the large number of model parameters to be learnt, makes it extremely difficult to reverse engineer even a small network. For a medium or large network, the complexity becomes enormous. In this paper, we propose a method for managing large scale GRN modeling. As first step, we propose an Affinity Propagation Based Clustering to identify appropriate clusters by grouping the genes based on their time expression profiles. In the second step, the largest cluster consisting of majority of the relevant genes is considered in full detail to act as the core of the network while the other remaining clusters, which are not so significant, are each represented by their single representative gene to obtain a reduced order GRN. In the third step, we optimize the entire network by initializing the model parameters of the genes of the largest cluster with the values obtained in the second step (which are near optimal) and proceed to optimize the entire network. The initial investigations are carried out using previously reported 20-gene synthetic network. The superiority of performance is evaluated not only using the standard metrics, namely, sensitivity, specificity, precision and F-score, but also by average mean error and by comparing the time responses with those of the actual network parameters. The results obtained are promising. © 2015 IEEE.
Best practice data life cycle approaches for the life sciences
- Griffin, Philippa, Khadake, Jyoti, LeMay, Kate, Lewis, Suzanna, Orchard, Sandra, Pask, Andrew, Pope, Bernard, Roessner, Ute, Russell, Keith, Seemann, Torsten, Treloar, Andrew, Tyagi, Sonika, Christiansen, Jeffrey, Dayalan, Saravanan, Gladman, Simon, Hangartner, Sandra, Hayden, Helen, Ho, William, Keeble-Gagnère, Gabriel, Korhonen, Pasi, Neish, Peter, Prestes, Priscilla, Richardson, Mark, Watson-Haigh, Nathan, Wyres, Kelly, Young, Neil, Schneider, Maria
- Authors: Griffin, Philippa , Khadake, Jyoti , LeMay, Kate , Lewis, Suzanna , Orchard, Sandra , Pask, Andrew , Pope, Bernard , Roessner, Ute , Russell, Keith , Seemann, Torsten , Treloar, Andrew , Tyagi, Sonika , Christiansen, Jeffrey , Dayalan, Saravanan , Gladman, Simon , Hangartner, Sandra , Hayden, Helen , Ho, William , Keeble-Gagnère, Gabriel , Korhonen, Pasi , Neish, Peter , Prestes, Priscilla , Richardson, Mark , Watson-Haigh, Nathan , Wyres, Kelly , Young, Neil , Schneider, Maria
- Date: 2018
- Type: Text , Journal article
- Relation: F1000 Research Vol. 6, no. (2018), p. 1-28
- Full Text:
- Reviewed:
- Description: Throughout history, the life sciences have been revolutionised by technological advances; in our era this is manifested by advances in instrumentation for data generation, and consequently researchers now routinely handle large amounts of heterogeneous data in digital formats. The simultaneous transitions towards biology as a data science and towards a 'life cycle' view of research data pose new challenges. Researchers face a bewildering landscape of data management requirements, recommendations and regulations, without necessarily being able to access data management training or possessing a clear understanding of practical approaches that can assist in data management in their particular research domain. Here we provide an overview of best practice data life cycle approaches for researchers in the life sciences/bioinformatics space with a particular focus on 'omics' datasets and computer-based data processing and analysis. We discuss the different stages of the data life cycle and provide practical suggestions for useful tools and resources to improve data management practices. © 2018 Griffin PC et al.
- Authors: Griffin, Philippa , Khadake, Jyoti , LeMay, Kate , Lewis, Suzanna , Orchard, Sandra , Pask, Andrew , Pope, Bernard , Roessner, Ute , Russell, Keith , Seemann, Torsten , Treloar, Andrew , Tyagi, Sonika , Christiansen, Jeffrey , Dayalan, Saravanan , Gladman, Simon , Hangartner, Sandra , Hayden, Helen , Ho, William , Keeble-Gagnère, Gabriel , Korhonen, Pasi , Neish, Peter , Prestes, Priscilla , Richardson, Mark , Watson-Haigh, Nathan , Wyres, Kelly , Young, Neil , Schneider, Maria
- Date: 2018
- Type: Text , Journal article
- Relation: F1000 Research Vol. 6, no. (2018), p. 1-28
- Full Text:
- Reviewed:
- Description: Throughout history, the life sciences have been revolutionised by technological advances; in our era this is manifested by advances in instrumentation for data generation, and consequently researchers now routinely handle large amounts of heterogeneous data in digital formats. The simultaneous transitions towards biology as a data science and towards a 'life cycle' view of research data pose new challenges. Researchers face a bewildering landscape of data management requirements, recommendations and regulations, without necessarily being able to access data management training or possessing a clear understanding of practical approaches that can assist in data management in their particular research domain. Here we provide an overview of best practice data life cycle approaches for researchers in the life sciences/bioinformatics space with a particular focus on 'omics' datasets and computer-based data processing and analysis. We discuss the different stages of the data life cycle and provide practical suggestions for useful tools and resources to improve data management practices. © 2018 Griffin PC et al.
A review of analytical techniques and their application in disease diagnosis in breathomics and salivaomics research
- Beale, David, Jones, Oliver, Karpe, Avinash, Dayalan, Saravanan, Oh, Ding, Kouremenos, Konstantinos, Ahmed, Warish, Palombo, Enzo
- Authors: Beale, David , Jones, Oliver , Karpe, Avinash , Dayalan, Saravanan , Oh, Ding , Kouremenos, Konstantinos , Ahmed, Warish , Palombo, Enzo
- Date: 2017
- Type: Text , Journal article
- Relation: International Journal of Molecular Sciences Vol. 18, no. 1 (2017), p. 1-26
- Full Text:
- Reviewed:
- Description: The application of metabolomics to biological samples has been a key focus in systems biology research, which is aimed at the development of rapid diagnostic methods and the creation of personalized medicine. More recently, there has been a strong focus towards this approach applied to non-invasively acquired samples, such as saliva and exhaled breath. The analysis of these biological samples, in conjunction with other sample types and traditional diagnostic tests, has resulted in faster and more reliable characterization of a range of health disorders and diseases. As the sampling process involved in collecting exhaled breath and saliva is non-intrusive as well as comparatively low-cost and uses a series of widely accepted methods, it provides researchers with easy access to the metabolites secreted by the human body. Owing to its accuracy and rapid nature, metabolomic analysis of saliva and breath (known as salivaomics and breathomics, respectively) is a rapidly growing field and has shown potential to be effective in detecting and diagnosing the early stages of numerous diseases and infections in preclinical studies. This review discusses the various collection and analyses methods currently applied in two of the least used non-invasive sample types in metabolomics, specifically their application in salivaomics and breathomics research. Some of the salient research completed in this field to date is also assessed and discussed in order to provide a basis to advocate their use and possible future scientific directions. © 2016 by the authors; licensee MDPI, Basel, Switzerland.
- Authors: Beale, David , Jones, Oliver , Karpe, Avinash , Dayalan, Saravanan , Oh, Ding , Kouremenos, Konstantinos , Ahmed, Warish , Palombo, Enzo
- Date: 2017
- Type: Text , Journal article
- Relation: International Journal of Molecular Sciences Vol. 18, no. 1 (2017), p. 1-26
- Full Text:
- Reviewed:
- Description: The application of metabolomics to biological samples has been a key focus in systems biology research, which is aimed at the development of rapid diagnostic methods and the creation of personalized medicine. More recently, there has been a strong focus towards this approach applied to non-invasively acquired samples, such as saliva and exhaled breath. The analysis of these biological samples, in conjunction with other sample types and traditional diagnostic tests, has resulted in faster and more reliable characterization of a range of health disorders and diseases. As the sampling process involved in collecting exhaled breath and saliva is non-intrusive as well as comparatively low-cost and uses a series of widely accepted methods, it provides researchers with easy access to the metabolites secreted by the human body. Owing to its accuracy and rapid nature, metabolomic analysis of saliva and breath (known as salivaomics and breathomics, respectively) is a rapidly growing field and has shown potential to be effective in detecting and diagnosing the early stages of numerous diseases and infections in preclinical studies. This review discusses the various collection and analyses methods currently applied in two of the least used non-invasive sample types in metabolomics, specifically their application in salivaomics and breathomics research. Some of the salient research completed in this field to date is also assessed and discussed in order to provide a basis to advocate their use and possible future scientific directions. © 2016 by the authors; licensee MDPI, Basel, Switzerland.
On the complexity and completeness of robust biclustering algorithm (ROBA)
- Ibrahim, Yousef, Noman, Nasimul, Iba, Hitoshi
- Authors: Ibrahim, Yousef , Noman, Nasimul , Iba, Hitoshi
- Date: 2010
- Type: Text , Conference proceedings
- Relation: 4th International Conference on Bioinformatics and Biomedical Engineering, iCBBE 2010; Chengdu; China; 18th- 20th June 2010 published in 2010 4th International Conference on Bioinformatics and Biomedical Engineering, iCBBE 2010
- Full Text: false
- Reviewed:
- Description: A biclustering algorithm named ROBA has been used in a number of recent works. We present a time and space efficient implementation of ROBA that reduces the time and space complexity by an order of L where L is the number of distinct values present in the data. Our implementation runs almost 11 times faster than the existing implementation on Yeast gene expression dataset. We also improve ROBA and then use it to present an iterative algorithm that can And all perfect biclusters with constant values, constant values on rows and constant values on columns. Though our algorithm may take exponential time in the worst case, we use some subtle observations to reduce computational time and space. Experimental result reveals that our algorithm runs in reasonable time on Yeast gene expression dataset and finds almost 10 times more biclusters than ROBA. ©2010 IEEE.
- Description: 2010 4th International Conference on Bioinformatics and Biomedical Engineering, iCBBE 2010
Analysis of Classifiers for Prediction of Type II Diabetes Mellitus
- Barhate, Rahul, Kulkarni, Pradnya
- Authors: Barhate, Rahul , Kulkarni, Pradnya
- Date: 2018
- Type: Text , Conference proceedings , Conference paper
- Relation: 4th International Conference on Computing, Communication Control and Automation, ICCUBEA 2018
- Full Text:
- Reviewed:
- Description: Diabetes mellitus is a chronic disease and a health challenge worldwide. According to the International Diabetes Federation, 451 million people across the globe have diabetes, with this number anticipated to rise up to 693 million people by 2045. It has been shown that 80% of the complications arising from type II diabetes can be prevented or delayed by early identification of the people who are at risk. Diabetes is difficult to diagnose in the early stages as its symptoms grow subtly and gradually. In a majority of the cases, the patients remain undiagnosed until they are admitted for a heart attack or begin to lose their sight. This paper analyzes the different classification algorithms based on a patient's health history to aid doctors identify the presence of as well as promote early diagnosis and treatment. The experiments were conducted on Pima Indian Diabetes data set. Various classifiers used include K Nearest Neighbors, Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, Support Vector Machine and Neural Network. Results demonstrate that Random Forests performed well on the data set giving an accuracy of 79.7%. © 2018 IEEE.
- Description: E1
- Authors: Barhate, Rahul , Kulkarni, Pradnya
- Date: 2018
- Type: Text , Conference proceedings , Conference paper
- Relation: 4th International Conference on Computing, Communication Control and Automation, ICCUBEA 2018
- Full Text:
- Reviewed:
- Description: Diabetes mellitus is a chronic disease and a health challenge worldwide. According to the International Diabetes Federation, 451 million people across the globe have diabetes, with this number anticipated to rise up to 693 million people by 2045. It has been shown that 80% of the complications arising from type II diabetes can be prevented or delayed by early identification of the people who are at risk. Diabetes is difficult to diagnose in the early stages as its symptoms grow subtly and gradually. In a majority of the cases, the patients remain undiagnosed until they are admitted for a heart attack or begin to lose their sight. This paper analyzes the different classification algorithms based on a patient's health history to aid doctors identify the presence of as well as promote early diagnosis and treatment. The experiments were conducted on Pima Indian Diabetes data set. Various classifiers used include K Nearest Neighbors, Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, Support Vector Machine and Neural Network. Results demonstrate that Random Forests performed well on the data set giving an accuracy of 79.7%. © 2018 IEEE.
- Description: E1
A heuristic gene regulatory networks model for cardiac function and pathology
- Zarnegar, Armita, Vamplew, Peter, Stranieri, Andrew, Jelinek, Herbert
- Authors: Zarnegar, Armita , Vamplew, Peter , Stranieri, Andrew , Jelinek, Herbert
- Date: 2016
- Type: Text , Conference proceedings
- Relation: 2016 Computing in Cardiology Conference (CinC); Vancouver; 11-14th Sept, 2016
- Full Text: false
- Reviewed:
- Description: Genome-wide association studies (GWAS) and next-generation sequencing (NGS) has led to an increase in information about the human genome and cardiovascular disease. Understanding the role of genes in cardiac function and pathology requires modeling gene interactions and identification of regulatory genes as part of a gene regulatory network (GRN). Feature selection and data reduction not sufficient and require domain knowledge to deal with large data. We propose three novel innovations in constructing a GRN based on heuristics. A 2D Visualised Co-regulation function. Post-processing to identify gene-gene interactions. Finally a threshold algorithm is applied to identify the hub genes that provide the backbone of the GRN. The 2D Visualized Co-regulation function performed significantly better compared to the Pearson's correlation for measuring pairwise associations (t=3.46, df=5, p=0.018). The F-measure, improved from 0.11 to 0.12. The hub network provided a 60% improvement to that reported in the literature. The performance of the hub network was then also compared against ARACNe and performed significantly better (p=0.024). We conclude that a heuristics approach in developing GRNs has potential to improve our understanding of gene regulation and interaction in diverse biological function and disease.
Machine learning in mental health: a scoping review of methods and applications
- Shatte, Adrian, Hutchinson, Delyse, Teague, Samantha
- Authors: Shatte, Adrian , Hutchinson, Delyse , Teague, Samantha
- Date: 2019
- Type: Text , Journal article
- Relation: Psychological Medicine Vol. 49, no. 9 (2019), p. 1426-1448
- Full Text: false
- Reviewed:
- Description: This paper aims to synthesise the literature on machine learning (ML) and big data applications for mental health, highlighting current research and applications in practice. We employed a scoping review methodology to rapidly map the field of ML in mental health. Eight health and information technology research databases were searched for papers covering this domain. Articles were assessed by two reviewers, and data were extracted on the article's mental health application, ML technique, data type, and study results. Articles were then synthesised via narrative review. Three hundred papers focusing on the application of ML to mental health were identified. Four main application domains emerged in the literature, including: (i) detection and diagnosis (ii) prognosis, treatment and support (iii) public health, and (iv) research and clinical administration. The most common mental health conditions addressed included depression, schizophrenia, and Alzheimer's disease. ML techniques used included support vector machines, decision trees, neural networks, latent Dirichlet allocation, and clustering. Overall, the application of ML to mental health has demonstrated a range of benefits across the areas of diagnosis, treatment and support, research, and clinical administration. With the majority of studies identified focusing on the detection and diagnosis of mental health conditions, it is evident that there is significant room for the application of ML to other areas of psychology and mental health. The challenges of using ML techniques are discussed, as well as opportunities to improve and advance the field.
- «
- ‹
- 1
- ›
- »