Modeling the undiagnosed HIV epidemic



The human immunodeficiency virus (HIV) has a devastating effect on human health. HIV treatment provides an efficient tool to control virus replication and therefore prevents patients from getting sick. Equally important, it significantly decreases the patient's infectiousness and with that the likelihood for the individual to cause new infections. Therefore, therapy is an important preventive measure, and it is imperative that newly infected individuals are diagnosed as soon as possible. 

Research question

To optimize diagnostic strategies, it is important to improve our understanding of the undiagnosed proportion of HIV patients and the time it takes for HIV-positive individuals to get diagnosed. A new technique to estimate these epidemic properties was recently presented by van Sighem et al [1]. van Sighem extends an epidemiological compartment model that describes HIV progression by representing the different disease stages. This model is fitted to surveillance data, more specifically a marker for disease progression (i.e. CD4 cell count). The fitted model can thus be used to infer statistics on the undiagnosed HIV epidemic over time, while it is fitted solely to surveillance data that is only available for diagnosed patients.

The epidemiological compartment model presented by van Sighem et al. 

You will implement this model, and fit it to Portuguese surveillance data. You compare the model's output to current estimates on the diagnosed and undiagnosed HIV epidemic.

While the model presented in [1] is promising, there are some limitations which we believe might be mitigated by the use of Bayesian skyline plots [2]. Bayesian skyline plots allow to infer trends concerning changes in population size from genetic sequence data. Genetic sequences are commonly available in the context of HIV, since for each HIV patient a virus sequence is determined as standard-of-care.

A Bayesian skyline plot derived from an alignment of Egyptian HCV sequences (Drummond et al. [2])

You will infer a skyline plot from the virus sequences in the Portuguese clinical database using the BEAST software [3]. You will compare the results of the van Sighem and BEAST method, and formulate a conclusion discussing the strengths and limitations of both methods.


[1] van Sighem, A., Nakagawa, F., De Angelis, D., Quinten, C., Bezemer, D., de Coul, E. O., … Phillips, A. (2015). Estimating HIV Incidence, Time to Diagnosis, and the Undiagnosed HIV Epidemic Using Routine Surveillance Data. Epidemiology (Cambridge, Mass.), 26(5), 653–60.

[2] Drummond, A. J., Rambaut, A., Shapiro, B., & Pybus, O. G. (2005). Bayesian coalescent inference of past population dynamics from molecular sequences. Molecular Biology and Evolution, 22(5), 1185–1192.

[3] Drummond, A. J., & Rambaut, A. (2007). BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology, 7(1), 214. 

Research context

This research project offers challenges in the context of computational biology and computer science. When your research efforts are successful, the results will be used to improve our understanding about the HIV diagnostic strategies and to inform epidemiological models. Therefore, your research will contribute to the state-of-the art of HIV research and assist in the formulation of new diagnostic strategies.

We believe that, when this research project is executed successfully, the results will be suitable to be presented in a peer-reviewed conference or journal.


Prior knowledge in the field of machine learning is required (e.g. the course of Machine Learning or Statistical Machine Learning). Prior knowledge in dealing with genetic data is recommended, we therefore strongly suggest following the master course on "Algorithms in Computational Biology and Bioinformatics". The project does not require any prior knowledge related to epidemiology, an interest in this matter is highly appreciated.


prof. dr. Ann Nowé 


Contact Pieter Libin for more information -- by e-mail ( or visit me at 10.G.730A.