Question

paired-end read: background + practical question - Strainphlan pipeline

0

Entering edit mode

7.0 years ago

CAnna ▴ 20

Hi,

I have few background questions about paired-end reads and related practical questions to handle paired-end reads with (Strainphlan) pipeline. The data I'm using are metagenomic shotgun sequencing from the Human Microbiome Project.

My understanding of paired-end reads is that when sequencing, we get one read starting from one end of the fragment, and one starting from the other end, the purpose being to have a better coverage of the sequence.

Shouldn't there be approximately the same number of reads in both files? For most samples, I have huge difference in files size. Typically the sample.1.fastq contains 8 millions reads while the sample.2.fastq has only 2 millions. Why such a big difference?
For genome assembly I get it is helpful. In my case I want to use this data to identify species and strains in those metagenomic samples. I do not have an intuition about how important it is to get the second read (is there so many errors?).

The reason for this second question is that the pipeline I am using does not handle paired-end reads. Here is what the help says: "MetaPhlAn 2 can also natively handle paired-end metagenomes (but does not use the paired-end information)"

What does mean "it handles paired end reads" if it does not use paired-end information?
If I run my command with --input_file sample.1.fastq,sample.2.fastq, it runs into an error because of orphan reads, but if I clear out all those orphan reads, I loose a lot of data (hence my question above).
If I run the command only specifying --input_file sample.1.fastq, I'm not sure how reliable it is to use only a single read and completely ignoring the second one.

Thanks, Camille

paired-end reads • 3.1k views

ADD COMMENT • link updated 7.0 years ago by GenoMax 148k • written 7.0 years ago by CAnna ▴ 20

score 1 · Answer 1 · 2018-01-26

Some additional comments beyond things covered by @Macspider.

Typically the sample.1.fastq contains 8 millions reads while the sample.2.fastq has only 2 millions. Why such a big difference?

It is possible that the read files were trimmed independently. Which is not recommended. This can break read order sync between the files (which seems to have happened in your case). When a read is removed from one of the files (failing some criteria you are using) then the corresponding read from the other file needs to be taken out to maintain read oder in R1/R2.

There is a tool called repair.sh in BBMap suite that can fix files (re-sync reads) broken in this manner.

For genome assembly I get it is helpful. In my case I want to use this data to identify species and strains in those metagenomic samples. I do not have an intuition about how important it is to get the second read (is there so many errors?)

If you are purely looking for identification then using a single read should be enough.

(but does not use the paired-end information)

Since you don't have a reference to align to, spatial information provided by paired-end reads (how far apart the two reads are w.r.t a reference, which gives you size of the fragment being sequenced) is not useful in your case.

I'm not sure how reliable it is to use only a single read and completely ignoring the second one.

If you had only done single end sequencing would it make that single-end read not reliable? It would not.

score 0 · Answer 2 · 2018-01-26

we get one read starting from one end of the fragment

Yes.

the purpose being to have a better coverage of the sequence.

Not only, the purpose is to also have the fragment length information to be used for mapping distance. You know that those two reads have to map at a distance on the genome which is compatible with the length of the fragment (i.e. more accurate mapping). Also, you can use them to link / scaffold contigs.

Shouldn't there be approximately the same number of reads in both files?

There should be the EXACT same number in both files. If reads are not quality trimmed, they also should have the same length.

I do not have an intuition about how important it is to get the second read

Answer is in my second answer.

What does mean "it handles paired end reads" if it does not use paired-end information?

I suppose it means that it can map multiple reads from multiple files but doesn't try to map them closely, as paired end reads should.

If I run my command with --input_file sample.1.fastq,sample.2.fastq, it runs into an error

I suppose there is a command to specify a second input file (as for paired-end reads). This way you're mapping read 1 and read 2 both as read 1. Therefore, there is no read 2 (orphan).

I'm not sure how reliable it is to use only a single read and completely ignoring the second one

That is not the best thing to do, indeed.