Hello everybody,
I am fairly new to bioinformatics, so I have some basic concept questions that I hope you guys can help me to understand.
I hope I'll manage to express myself properly, so you know what my questions are about.
Let's say I download a reference genome, e.g. the new GRCh38. What I get is essentially a huge string (plus headers > 1 ….
) of bases and N.
If I understand it correctly, this is 5’ to 3’ nucleotide sequence of the respective chromosome. So here is my first question: Given that the DNA we are sequencing is a double strand, which single strand is depicted on the reference genome? Adding to that question: are different reference sequences e.g. hg38 and GRCh38 using the same single strand?
If so, how was it determined which strand is the “reference stand”.
Now to my next question. Imagine we are doing paired-end DNA seq, an prepare our library, so that the first read should align to the forward strand and the second read should align to the reverse strand (if I am not mistaken it gets represented like this F1R2
).
When aligning the reads to my reference sequence, the first read of the pair, should be exactly present in the reference sequence, base for base (given no SNP/INDEL). Theoretically, I could search through my GRCh38.fa file (e.g. with Strg+F
) and find the exact match (again, given no SNP/INDEL/softclipping). Is that correct?
For the read 2 of the pair it gets trickier. If I understand it correctly, the reverse complement of the sequenced read 2 should be present in the reference genome.
If this is correct, does that mean when using an aligner like bwa mem
, read 2 will get converted to the reverse complement on the fly and then aligned?
If this is the case, why don’t we have to tell bwa mem
about our library layout (e.g. specify F1R2)? Can the program interfere the layout by trying to align a subset of the reads and look at the alignment rates?
If anybody could help me answer even some of these questions, I’d be extremely thankful!
Cheers!
Sequencing always happens in 5' --> 3' direction. Strand present in reference happens to be the strand that was assembled from data. There is only one reference sequence build for major model organisms (human/mouse). It is produced by Genome Reference Consortium. GENCODE uses that sequence to annotate official gene features. Places like Ensembl, UCSC may provide additional annotation but the underlying reference sequence is the same everywhere. Only major genome build releases change chromosome co-ordinates. Minor/patches releases don't.
Since DNA is anti-parallel, each strand has an equal probability of getting sequenced (you are fragmenting DNA before making libraries). So you could do what you are proposing, as long as you do that for both strands in 5' --> 3' direction. Aligners automatically rev-comp your sequences when aligning to reference.