Integrating Gene Ontology Based Grouping and Ranking into the Machine Learning Algorithm for Gene Expression Data Analysis

Yousef M., Sayici A., Bakir-Gungor B.

32nd International Conference on Database and Expert Systems Applications (DEXA), ELECTR NETWORK, 27 - 30 September 2021, vol.1479, pp.205-214 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Volume: 1479
  • Doi Number: 10.1007/978-3-030-87101-7_20
  • Page Numbers: pp.205-214
  • Abdullah Gül University Affiliated: Yes


Recent advances in the high throughput technologies resulted in the production of large gene expression data sets for several phenotypes. Via comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc., one could identify biomarkers. As opposed to traditional gene selection approaches, integrative gene selection approaches incorporate domain knowledge from external biological resources during gene selection, which improves interpretability and predictive performance. In this respect, Gene Ontology provides cellular component, molecular function and biological process terms for the products of each gene. In this study, we present Gene Ontology based feature selection approach for gene expression data analysis. In our approach, we used the ontology information as grouping (term) information and embedded this information into a machine learning algorithm for selecting the most significant groups (terms) of ontology. Those groups are used to build the machine learning model in order to perform the classification task. The output of the tool is a significant ontology group for the task of 2-class classification applied on the gene expression data. This knowledge allows the researcher to perform more advanced gene expression analyses. We tested our approach on 8 different gene expression datasets. In our experiments, we observed that the tool successfully found the significant Ontology terms that would be used as a classification model. We believe that our tool will help the geneticists to identify affected genes in transcriptomic data and this information could enable the design of platforms to assist diagnosis, to assess patients' prognoses, and to create patient treatment plans.