Projects
Pyrite – Unlocking Docking
Bachelor thesis, completing the bachelor of Pharmaceutical Sciences at VU Amsterdam.
Grade: 9.5
I am further pursuing the Pyrite project within the group of Daan Geerke together with David Poole.
Thesis abstract:
Pyrite is a new, open, extensible, and accessible, Python-based framework for molecular docking that aims to allow straightforward testing and development of existing and novel docking methods. We introduce (i) a grid-based binding pocket detection and scoring algorithm that guides searching to plausible binding sites, (ii) a crowding and niching strategy that enforces pose diversity, and (iii) seamless integration with existing biochemical tools and optimization suites such as rdkit, SciPy, and pygmo. Benchmarking on the 2024 PDBbind demo set shows that using the Pocket function accelerates searching 2-6 × compared to the Vina and Piecewise Linear Potential functions while maintaining, or in blind docking even surpassing, their accuracy. An example reference workflow — pocket + crowding for search with ant colony optimization, Piecewise Linear Potential for refinement — shows an accuracy of 37 ± 6 % of poses within 2 Å RMSD for autoboxed docking, and 16 ± 3 % in whole-protein mode. Error analysis reveals that most errors stem from searching errors, rather than scoring problems. Overly stringent pocket-depth filters and insufficient niching result in accuracy loss in the searching phase. Both problems are addressed in Pyrites development roadmap.
Error Analysis – SARS-CoV-2 variant detection
Bachelor thesis, completing the bachelor of Computer Science and Engineering at TU Delft.
Grade: 9.0
Research published as part of ‘Lineage abundance estimation for SARS-CoV-2 in wastewater using transcriptome quantification techniques. Baaijens, Jasmijn A.’ (https://doi.org/10.1186/s13059-022-02805-9).
Thesis abstract:
Since the start of the SARS-CoV-2 pandemic, the monitoring of SARS-CoV-2 by way of viral RNA sequencing of wastewater has proven to be an efficient and effective way of estimating COVID-19 cases in population groups. A recently developed pipeline also enables us to estimate SARS-CoV-2 variant abundance using viral samples from wastewater. This is done by repurposing an RNA-seq quantification algorithm to quantify reads, belonging to variants, from DNA-sequencing data. However, the impact of sequencing errors and contaminating viruses on this process is unknown. Here I show that, in simulated data, the credibility of the prediction results is dependent on the error rate of the sequencing machines used. I also show that contaminating the simulated dataset with certain human coronaviruses has a significant effect on prediction accuracy. However, most viruses currently found in wastewater have no effect. Furthermore, adding a reference genome for these human coronaviruses to the reference set removes any impact. The results demonstrate that it is important to assess the credibility of the pipeline on a case by case basis and to tailor the testing setup and reference set to this assessment.