Question

contents of the fastq file

0

Entering edit mode

7.0 years ago

Dayna ▴ 50

Hello

I am very very new. Need to understand many confusion in my analysis. 2 related points:

I have a fastq file has 151 bases and its pair has 151 bases. Should I assume that the 151 bases includes the adapters. So the actual read length is less than 151? As in this link: What is the difference between a Read and a Fragment in RNA-seq? because the fastqc file contains the read+adapters? Correct?
The reads overlap nicely and the reads have too long overlapping? The reads are overlapping with 100 bases. I feel this is not good based on my understanding. Am I correct?

Thanks

fastq RNA-Seq • 2.1k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 7.0 years ago by Dayna ▴ 50

score 1 · Accepted Answer · 2017-11-17

1

Entering edit mode

7.0 years ago

Devon Ryan 104k

You should run fastqc and see if it indicates that there's adapter contamination. If the fragment sizes were >151 bases (this is quite likely) then you'll have no adapters on the 3' end.
Whether the reads overlap or not is irrelevant for most cases. If you want to do assembly then often people like to merge overlapping reads (this helps with error correction). If you're doing differential expression then you just map without any merging because the aligners can handle that without problems. If your original fragments were 100 bases long and your reads are 151 bases, then you're going to want to remove the adapters on each end (or use local alignment and just be done with it).

ADD COMMENT • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks so much. But I feel ok with your first point. Second point got it and I am confused from "if your fragments were 100 bases", I mean the reads overlap with 100 bases, but don't get what you mean.

ADD REPLY • link 7.0 years ago by Dayna ▴ 50

1

Entering edit mode

In short, do you end up with this:

==========> read1
        <========== read2

or this:

==========> read1
<========== read2

In the second case, if the original fragment of DNA loaded onto the sequencer is longer than the read length then you get:

   ==========>### read1
###<==========    read2

where # is adapter sequence. You can have 100 base overlap in either case, but in one you have adapter contamination due to short fragment lengths and in the other you don't.

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks Devon so much. so if I have insert size ~200 and reads 2 x151 then I am in the second case, right?

ADD REPLY • link 7.0 years ago by Dayna ▴ 50

0

Entering edit mode

If your inserts (i.e., the fragments before ligating adapters) are ~200 then you won't have much if any adapter contamination.

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

Confused, if insert size not fragment size is 200, then I am the second case? If the insert size is 180 then we are in the second case 2.

ADD REPLY • link 7.0 years ago by Dayna ▴ 50

0

Entering edit mode

If your insert is >= your read length then you are in the first case. If the insert is smaller than the read length then you are in the second case.

In either case, you can always just run things through an adapter trimmer, it's not going to hurt anything.

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

score 1 · Accepted Answer · 2017-11-17

1

Entering edit mode

7.0 years ago

BioinfGuru ★ 2.1k

1) No you cannot assume this. Adapters may or may not be present on each individual read. Normally you would use another program to check the quality of fastq file, and the output tells you which reads include adapters if they are there. I recommend a program called fastqc.

2) I am assuming when you say the "reads overlap" that you mean the read pairs. Please define A) overlapping, B) how you identified how much the reads overlap and C) why overlapping is important to you

ADD COMMENT • link 7.0 years ago by BioinfGuru ★ 2.1k

0

Entering edit mode

I mean the insert size is small the reads overlap, have a common bases. I am not speaking in this post about a specific problem, I am still in a b c and trying to make sure I understand right. So when the overlap is long the insert size is small, we have less genomic spaces from the read. I want to know the norms people think of about small insert sizes and long overlap. thanks so much

ADD REPLY • link 7.0 years ago by Dayna ▴ 50

1

Entering edit mode

Dont worry about it. Really. Just get on with fastqc, trimming and assembly. Those steps will identify if you have any problems.

ADD REPLY • link 7.0 years ago by BioinfGuru ★ 2.1k