Wolfram Computation Meets Knowledge

Wolfram Summer School

Alumni

Karthik Gangavarapu

Science and Technology

Class of 2017

Bio

Karthik is a PhD student in the Andersen Lab and the Su Lab at the Scripps Research Institute. His current work involves epidemiological analysis of infectious disease outbreaks and applying computational methods to detect pathogens, in an unbiased manner, from genomic sequencing data. He did his undergraduate studies at Birla Institute of Science and Technology, Pilani, India. As an undergraduate, he cofounded Tune Patrol to help indie music artists monetize their music on the internet. He was also part of the Google Summer of Code program, during which he worked with Su Lab at the Scripps Research Institute.

His last project was analyzing the introduction of the Zika virus into Florida, USA, during the 2015–2016 Zika virus pandemic. He is interested in applying the NKS methodology in his research work to develop epidemiological models for infectious disease outbreak monitoring.

Computational Essay

Metropolis Hastings Algorithm »

Project: Predicting Pathogenicity from Viral Genomic Sequences

Goal of the project:

The goal of the project is to train artificial neural networks to identify pathogenic viral species from genomic sequences. Viral genomic sequences will be extracted from the NCBI Reference Sequence Database (RefSeq), and the pathogen information will be extracted from the Human Disease Ontology. These labeled viral genomes will then be used to train and test a convolutional neural network.

Summary of work:

FASTA files from RefSeq were downloaded and labeled using the pathogen information obtained from a locally hosted Neo4j instance of the Human Disease Ontology. This data was then uploaded to the Wolfram Data Repository (the data can be found here). Viral sequences were then broken into chunks of 1,000 bps and labeled as being pathogenic or non-pathogenic. This labeled data was then used to train and test a convolutional neural network.

Results and future work:

Experiments with multiple configurations of convolutional neural networks were conducted. The best-performing network among these was able to attain an accuracy of 76% on a test set. Going forward, it would be necessary to increase the training and test set size. There seem to be promising results using hybrid network configurations. Instead of chunking down the sequences into 1,000 bps, they can be converted into a sparse genome-feature matrix at the preprocessing stage.