Sumner completed a bachelor of science in neuroscience, mathematics and computer science and now pursues a PhD at the German Center for Neurodegenerative Diseases and European Neuroscience Institute. Experienced on both sides of the bench, he utilizes his domain knowledge to improve the performance of various biological analyses. Sumner enjoys the study of neural networks (physical, theoretical and computational) as well as graph theory.
Project: Exon Prediction
Goal of the project:
Due to poor identification, exons have remained obscure from study in biology often substituted for the analysis of their respective genes. However, as protein coding regions, mutations therein can give rise to disease thus making them imperative to understand for pathology. Recently, over 500 novel, mutually exclusive spliced exons were detected, suggesting that the human genome might harbor tens of thousands of yet undiscovered exons, many of which may carry disease-causing mutations. I aim to use deep learning to significantly extend the set of known human exons. Exon candidates can be experimentally validated.
Summary of work:
Exonic and intronic regions were extracted from the human genome utilizing data provided by UCSC’s genome browser and the BEDTools suite v2.6. A convolutional neural network augmented with a long short-term memory (LSTM) layer was constructed and trained on exonic sequences between 200 and 500 nucleotides (NTs), padded to a length of 600 NTs.
Results and future work:
Following training, the network had ~87.1% accuracy on the test set and ~90.0% accuracy on the training set; thus, the trained network is a robust model that does not over fit. While ~87.1% accuracy is respectable for raw DNA sequences, it is likely the network can be improved by also providing known genomic features into the network. Future work will incorporate a feature vector into the model and aim at validating novel, predicated exons.