Question

join non-overlapping paired-end reads

1

Entering edit mode

11.2 years ago

Daniel ★ 4.0k

I have a batch of illumina paired end reads with a shorter than anticipated reverse sequence for a ~550bp amplicon.

The joining tools which I have used before (i.e. USEARCH, ea-utils) talk about merging the overlaps and discarding other sequences but is there a way to join forwards and reverses leaving a gapped interval?

Expected output:

>seq1
ATGCAGTCGATCGATCGACTGCATGCATCGATCGAATCGATCG--------------TGCAGTCGATCGTACG
>seq2
CAGTCTAGCTACGATCGATCGACTGCATACGTACGTACG-------------------CTGCATGCATCGTAG
>seq3
CTAGACTGATCGCTAGCCTAGCTACGATCGATCGACACTGCTGTCGACTGACTGATCGATCGATCGACTGCAT

Thanks

paired-end illumina • 9.3k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 11.2 years ago by Daniel ★ 4.0k

1

Entering edit mode

You might consider checking out FLASH. It is really fast and has many options.

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.2 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Thanks, this looks good.

ADD REPLY • link 11.2 years ago by Daniel ★ 4.0k

0

Entering edit mode

Be aware that FLASH actually has a fairly high false positive rate.. It is still pretty useful though! However, i thought FLASH also can only concat mates which actually overlap?!?!

ADD REPLY • link 11.1 years ago by Phil S. ▴ 700

1

Entering edit mode

You're correct regarding requiring overlap. Nothing could accurately work without it unless the reads have already been mapped. FYI, it doesn't concatenate sequences, but merges them. Actually concatenating them wouldn't be very useful.

ADD REPLY • link 11.1 years ago by Devon Ryan 105k

0

Entering edit mode

Is there any tool to merge (i.e concatenate R1 and R2)the aligned reads ? This is particularly helpful in population genomics if you do not want linked SNPs i.e people need one snp per loci. If Read 1 has a SNP, they do not want to consider the SNP from Read 2.

ADD REPLY • link 10.6 years ago by GouthamAtla 12k

0

Entering edit mode

It would be simpler to just use a variant caller that won't do weird things in those cases. For example, samtools mpileup would set the phred score of the lower of those base calls to 0, which could then be trivially filtered.

ADD REPLY • link 10.6 years ago by Devon Ryan 105k

0

Entering edit mode

To directly answer you question, probably though I don't know of one off-hand. In any case, that'd be simple to write.

ADD REPLY • link 10.6 years ago by Devon Ryan 105k

0

Entering edit mode

I am trying in Pysam. It is giving some wired error discussed here Mate Pairs In Pysam

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.5 years ago by GouthamAtla 12k

0

Entering edit mode

Presumably the mate really isn't in the BAM file. Try using grep with samtools to see.

ADD REPLY • link 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

Its working. For some reasons, my read names have _1 and _2 which makes them different according to parser.

ADD REPLY • link 10.5 years ago by GouthamAtla 12k

0

Entering edit mode

That's not the best parser. Mates don't need to have the same name, so someone didn't think that out properly.

ADD REPLY • link 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

Am I correct in assuming that you've already aligned the reads?

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.2 years ago by Devon Ryan 105k

0

Entering edit mode

At this point, no. Are you saying I should align the forwards and reverses against a reference first before attempting to join?

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.2 years ago by Daniel ★ 4.0k

0

Entering edit mode

If they don't have significant overlap, then yes (it'd be impossible to accurately judge the distance between them otherwise).

ADD REPLY • link 11.2 years ago by Devon Ryan 105k

0

Entering edit mode

Better yet, explain what purpose is this endeavour supposed to satisfy?

ADD REPLY • link 11.0 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

Were you able to determine a good method for doing this? I am also interesting in merging non-overlapping reads which are from a friend's sequencing run in which they used primers targeting a 643 bp fragment which unfortunately is slightly too large to produce an overlap during a MiSeq run.

ADD REPLY • link 9.4 years ago by Longshotx ▴ 70

Ram · Answer 1 · 2014-08-24

I think the direct answer to your question is that, no, you cannot merge paired-end reads that do not overlap. However, you could attempt to merge all pairs (using tools already noted) and then leave the non-merged pairs as pairs. Then, align the merged reads and the paired-ends independently and combine the output. I am not sure how merging will impact your results relative to not merging.