Nafisah, Ibrahim Ali H (2015) Statistical analysis of genomic binding sites using high-throughput ChIP-seq data. PhD thesis, University of Leeds.
Abstract
This thesis focuses on the statistical analysis of Chromatin immunoprecipitation
sequencing (ChIP-Seq) data produced by Next Generation Sequencing (NGS). ChIP-Seq
is a method to investigate interactions between protein and DNA. Specifically, the method
aims to identify the binding sites of a particular protein of interest, such as a transcription
factor, in the genome. In the context of cancer research, this information is important
to check whether, for example, a particular transcription factor can be considered as a
therapeutic target.
The sequence data produced by ChIP-Seq experiment are in the form of mapped short
sequences, which are called reads. The reads are counted at each single genomic position,
and the read counts are the data to be analysed. There are many problems related to the
analysis of ChIP-Seq data, and in this research we focus on three of them.
First, in the analysis of ChIP-Seq data, the genome is not analysed in its entirety; instead
the intensity of read counts is estimated locally. Estimating the intensity of read counts
usually involves dividing the genome into small regions (windows). If the window size
is small, the noise level (low read counts) would dominate and many empty windows
would be observed. If the window size is large, the windows would have many small read
counts, which would smooth out some important features. The need exists for an approach
that enables researchers to choose an appropriate window size. To address this problem,
an approach was developed to optimise the window size. The approach optimises the
window size based on histogram construction. Note, the developed methodology is
published in [46].
Second, different studies of ChIP-Seq can target different transcription factors and then
give different conclusions, which is expected. However, they are all ChIP-Seq datasets
and many of them are performed on the same genome, for example the human genome.
So is there a pattern for the distribution of the counts? If the answer is yes, is the pattern common in all ChIP-Seq data? Answering this question can help in better understanding
the biology behind this experiment. We try to answer this question by investigating
RUNX1/ETO ChIP-Seq data. We try to develop a statistical model that is able to describe
the data. We employ some observed features in ChIP-Seq data to improve the performance
of the model. Although we obtained a model that is able to describe the RUNX1/ETO
data, the model does not provide a good statistical fit to the data.
Third, it is biologically important to know what changes (if any) occur at the binding sites
under some biological conditions, for example in knock-out experiments. Changes in the
binding sites can be either in the location of the sites or in the characteristics of the sites
(for example, the density of the read counts), or sometimes both. Current approaches for
differential binding sites analysis suffer from major drawbacks. First, unclear underlying
models as a result of dependencies between methods used, for example peak finding and
testing methods. Second, lack of accurate control of type-I error. Hence there is a need
for approach(es) to address these drawbacks. To address this problem, we developed three
statistical tests that are able to detect significantly differential regions between two ChIPSeq
datasets. The tests are evaluated and compared to some current methodologies by
using simulated and real ChIP-Seq datasets. The proposed tests exhibit more power as
well as accuracy compared to current methodologies.
Metadata
Supervisors: | GUSNANTO, A and TAYLOR, C and WESTHEAD, D |
---|---|
Awarding institution: | University of Leeds |
Academic Units: | The University of Leeds > Faculty of Maths and Physical Sciences (Leeds) > School of Mathematics (Leeds) > Statistics (Leeds) |
Identification Number/EthosID: | uk.bl.ethos.682274 |
Depositing User: | MR I A H NAFISAH |
Date Deposited: | 13 Apr 2016 08:49 |
Last Modified: | 26 Apr 2016 15:45 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:12475 |
Download
Final eThesis - complete (pdf)
Filename: IBRAHIM_NAFISAH.pdf
Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.