I am very very new. Need to understand many confusion in my analysis. 2 related points:
I have a fastq file has 151 bases and its pair has 151 bases. Should I assume that the 151 bases includes the adapters. So the actual read length is less than 151? As in this link: What is the difference between a Read and a Fragment in RNA-seq? because the fastqc file contains the read+adapters? Correct?
The reads overlap nicely and the reads have too long overlapping? The reads are overlapping with 100 bases. I feel this is not good based on my understanding. Am I correct?
You should run fastqc and see if it indicates that there's adapter contamination. If the fragment sizes were >151 bases (this is quite likely) then you'll have no adapters on the 3' end.
Whether the reads overlap or not is irrelevant for most cases. If you want to do assembly then often people like to merge overlapping reads (this helps with error correction). If you're doing differential expression then you just map without any merging because the aligners can handle that without problems. If your original fragments were 100 bases long and your reads are 151 bases, then you're going to want to remove the adapters on each end (or use local alignment and just be done with it).
Thanks so much. But I feel ok with your first point. Second point got it and I am confused from "if your fragments were 100 bases", I mean the reads overlap with 100 bases, but don't get what you mean.
In the second case, if the original fragment of DNA loaded onto the sequencer is longer than the read length then you get:
==========>### read1
###<========== read2
where # is adapter sequence. You can have 100 base overlap in either case, but in one you have adapter contamination due to short fragment lengths and in the other you don't.
1) No you cannot assume this. Adapters may or may not be present on each individual read. Normally you would use another program to check the quality of fastq file, and the output tells you which reads include adapters if they are there. I recommend a program called fastqc.
2) I am assuming when you say the "reads overlap" that you mean the read pairs. Please define A) overlapping, B) how you identified how much the reads overlap and C) why overlapping is important to you
I mean the insert size is small the reads overlap, have a common bases.
I am not speaking in this post about a specific problem, I am still in a b c and trying to make sure I understand right.
So when the overlap is long the insert size is small, we have less genomic spaces from the read. I want to know the norms people think of about small insert sizes and long overlap.
thanks so much
Thanks so much. But I feel ok with your first point. Second point got it and I am confused from "if your fragments were 100 bases", I mean the reads overlap with 100 bases, but don't get what you mean.
In short, do you end up with this:
or this:
In the second case, if the original fragment of DNA loaded onto the sequencer is longer than the read length then you get:
where
#
is adapter sequence. You can have 100 base overlap in either case, but in one you have adapter contamination due to short fragment lengths and in the other you don't.Thanks Devon so much. so if I have insert size ~200 and reads 2 x151 then I am in the second case, right?
If your inserts (i.e., the fragments before ligating adapters) are ~200 then you won't have much if any adapter contamination.
Confused, if insert size not fragment size is 200, then I am the second case? If the insert size is 180 then we are in the second case 2.
If your insert is >= your read length then you are in the first case. If the insert is smaller than the read length then you are in the second case.
In either case, you can always just run things through an adapter trimmer, it's not going to hurt anything.