Hope, Joshua (2020) Image Representations of DNA allow Classification by Convolutional Neural Networks. MSc by research thesis, University of York.
Abstract
In metagenomic analyses the rapid and accurate identification of DNA sequences is important. This is confounded by the existence of novel species not contained in databases. There exist many methods to identify sequences, but with the increasing amounts of sequencing data from high-throughput technologies, the use of new deep learning methods are made more viable. In an attempt to address this it was decided to use Convolutional Neural Networks (CNNs) to classify DNA sequences of archaea, which are important in anaerobic digestion. CNNs were trained on two different image representations of DNA sequences, Chaos Game Representation (CGR) and Reshape. Three phyla of archaea and randomly generated sequences were used. These were compared against simpler machine learning models trained on the 4-mer and 7-mer frequencies of the same sequences. It was found that the simpler models performed better than CNNs trained on either image representation, and that Reshape was the poorest representation. However, by shuffling sequences whilst preserving 4-mer count it was found that the Reshape model had learnt 4-mers as an important feature. It was also found that the Reshape model was able to perform equally well without depending on the use of 4-mers, indicating that certain training regimes may uncover novel features. The errors of these models were also random or in weak disagreement, suggesting ensemble methods would be viable and help to identify problematic sequences.
Metadata
Supervisors: | James, Chong |
---|---|
Keywords: | CNNs, Deep Learning, Metagenomics |
Awarding institution: | University of York |
Academic Units: | The University of York > Biology (York) |
Depositing User: | Mr Joshua Hope |
Date Deposited: | 28 Jun 2021 09:42 |
Last Modified: | 28 Jun 2021 09:42 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:28875 |
Download
Examined Thesis (PDF)
Filename: Hope_202003443_CorrectedThesisClean.pdf
Licence:
This work is licensed under a Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.