I'm still rather new with bioinformatics and as such I've been trying to establish my own "process" of sorts in regards to finding, understanding and working with DNA samples, particularly in regards to obtaining raw reads for alignment and then maybe SNP searching afterwards.
I used ENCODE's browsing features to reach the following experiment: https://www.encodeproject.org/experiments/ENCSR000DPV/
I've tried to break down the information displayed there into the key, important parts for me. I've listed some of that beneath the "Understanding" header below. Could you please fill in any gaps or correct me where my understanding is lacking or simply wrong?
Also, beneath the "Questions" header, I've requested information on some specific things that I'm struggling with. If you could provide answers to any of those questions, I'd be very appreciative.
Thank you in advance.
Understanding
- It's "ChIP-seq", not "RNA-seq", so for alignment I don't need to use TopHat - Bowtie 2 on it's own should work fine.
- It's mapped against the hg19 reference genome, so if I wish to attain similar mapping results I should use that genome too (or rather, I should just always use it since it's nowadays' standard?).
- As per this Biostars post, a link to download the hg19 reference in use by the 1000 genomes project (which is obviously the good stuff) can be found here: NCBI FTP.
- It seems that the reads from the experiment aren't related to any particular chromosome (since I can't find anything mentioning chromosomes), so for alignment I should just download the .fasta for the entire hg19 genome (currently
human_g1k_v37.fasta.gz
) instead of for any particular chromosome. - The full hg19 reference appears to have bunches of lines simply reading "NNNNNNNNNNNN...". I assume this is to act as a separation (or "gap") between chromosomes. This will not be detrimental to alignment.
- This experiment does NOT used paired-end reads. I don't need to worry about
--split-files
or anything like that. - There are 2 "biological samples". This means that ChIP-seq was performed twice on similar cells to produce two sets of reads for reliability - they weren't just programmatically replicated. I can choose to use either one, or both of them in my work.
- The raw sequence reads that I'd want to use for alignment are the .fastq files with an "Output type" of "reads" in the table of linked files. They're not already aligned, right?
- The experiment has a "Control". This is due to the presence of another attribute: "Antibody". I guess this means an antibody was used on the subject and so the DNA with the antibody is being compared to the control - that without the antibody.
Questions
- One of the attributes for the experiment, "Target", has a value "CTCF". Does this mean the reads and this experiment in general is focussed on a particular gene that's labelled "CTCF"? If not a gene, what is "CTCF"?
- The size of the hg19 reference I downloaded, once extracted, is 2.9Gb. However the size of the reads from the experiment is far less than that. Since reads also include overlaps, this means that only a small portion of the human genome is covered by these reads. How can I tell which portion of the genome is in subject, possibly so I can download only that section/those chromosome(s) of the genome instead of the whole thing? Is this based on the "Target" attribute?
- What, if any, are the relationships between the "Target" and the "Antibody"? Was the antibody somehow applied specifically to the target?
- It is my understanding that with the ENCODE project, the provided .fastq reads are very "raw" - i.e. submitted almost directly from the sequencing machines once they've finished, without much modification (see https://www.encodeproject.org/help/file-formats/). Is it recommended that I trim or perform any other operations on these reads before I use them for alignment? What operations are usually necessary?