White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Clustering Large Raw DNA Sequencing Datasets by Species of Origin using Signature Features of Genomic Sequence Composition

Hodges, Tobias (2012) Clustering Large Raw DNA Sequencing Datasets by Species of Origin using Signature Features of Genomic Sequence Composition. PhD thesis, University of York.

Available under License Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 UK: England & Wales.

Download (19Mb)


The establishment of high-throughput massively-parallel DNA sequencing technology has broadened the scope of metagenomics. The size and complexity of the datasets produced in such studies present considerable challenges. The aim of this project was to investigate the potential for genomic signature features to be applied to raw high-throughput sequencing reads generated from multi-species samples. Grouping reads according to the genome from which they originate could allow for the study of previously unknown or poorly- understood pathogens, and improve the performance of assembly of genome sequences from these reads. Genomic signatures were compared to find the best feature or combination for grouping reads by species of origin. A range of datasets were developed to provide an effective basis for such analysis. The performance of a number of clustering methods was also compared. The accuracy of grouping that could be achieved was evaluated, and the effect of such a grouping on the performance of sequence assembly was assessed. It was found that perfect species-specific grouping of raw sequencing data was outside of the scope of the approaches assessed here, but the enrichment of groups for reads from particular species was achievable. The single greatest obstacle to effective grouping was thought to be the short length of reads produced from current sequencing platforms. The individual assembly of grouped reads was found to produce results similar to those from assembling the dataset as a whole but with a reduction in the time required. The future of DNA sequencing is bright, with technology advancing at a startling pace, providing improvements in read length, dataset size and experimental run-time. It is hoped that these advancements will prove beneficial to the approaches investigated here, which are likely to remain useful as the size and complexity of datasets increases.

Item Type: Thesis (PhD)
Academic Units: The University of York > Biology (York)
Identification Number/EthosID: uk.bl.ethos.564176
Depositing User: Dr Tobias Hodges
Date Deposited: 10 Jan 2013 14:54
Last Modified: 08 Sep 2016 13:01
URI: http://etheses.whiterose.ac.uk/id/eprint/3202

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Actions (repository staff only: login required)