Grey Box Models for Pathogenic Classification of Genomic Variants

Research Context : With the emergence of new high throughput screening techniques, targeted or whole exome and genome screening are becoming standard diagnostic norms in clinical settings to identify the variants for a genetic disease. However, development of bioinformatics solutions for pathogenic classification of the variants still remains a big challenge and henceforth, making the process ponderous for geneticists and clinicians.

In the Machine Learning context, the aforesaid challenge can be considered as a multi-label classification problem. A variant needs to be classified into one of the five categories: Class I - Non-Pathogenic; Class II - VUS1 (unlikely pathogenic); Class III - VUS2 (unclear); Class IV - VUS3 (likely pathogenic); Class V - Pathogenic. But in the current framework, availability of the information about variants for set of population is limited, especially the amount of labelled pathogenic classification data. Thus, based on this fact, an architecture of a "Grey box model" has been proposed in the BRiDGEIris Project ( In a "Grey box model", the interpretable white box models are put subsequent to the black box models, in order to guide the data expensive white box models with the results from the accurate black box model. The overall idea is, that the black box model can generate auxiliary data for the white box models, as comparatively, the black box models can reliably classify unseen data, which can be used for further training of the white box model. Conclusively, the new model would enhance the model accuracy while preserving the interpretability of the system under investigation.

Research Setting : This work will focus on implementation and evaluation of ensemble methods as proposed in the BRiDGEIris Project, that can be used to analyze comorbid patterns based on the exome variant along with the associated phenomic and clinical data for a patient pertaining to a particular genetic disease. The overall aim will be screen and evaluate an optimal ensemble predictive modelling technique for the purpose of "pathogenic classification" of genomic variants (single nucleotide and short insertion/deletion).

You will be working on:

1. Developing a Knowledge base of variants based on classification results obtained from GeVaCT for patients of Cardiac Arrythmia Syndromes.

2. Implementation and Evaluation of Grey Box Models - Combining different set of black box and white box models.

3. Development of an interactive platform for clinicians, where they can select combination of algorithms from the set of black and white box models for analyzing variant classes.

Promoter : Prof. Dr.  Ann Nowé

Contact Person : For further queries or more information contact Dr. Dipankar Sengupta or visit him at 10G.730