I have a batch of illumina paired end reads with a shorter than anticipated reverse sequence for a ~550bp amplicon.
The joining tools which I have used before (i.e. USEARCH, ea-utils) talk about merging the overlaps and discarding other sequences but is there a way to join forwards and reverses leaving a gapped interval?
Expected output:
>seq1
ATGCAGTCGATCGATCGACTGCATGCATCGATCGAATCGATCG--------------TGCAGTCGATCGTACG
>seq2
CAGTCTAGCTACGATCGATCGACTGCATACGTACGTACG-------------------CTGCATGCATCGTAG
>seq3
CTAGACTGATCGCTAGCCTAGCTACGATCGATCGACACTGCTGTCGACTGACTGATCGATCGATCGACTGCAT
Thanks
You might consider checking out FLASH. It is really fast and has many options.
Thanks, this looks good.
Be aware that FLASH actually has a fairly high false positive rate.. It is still pretty useful though! However, i thought FLASH also can only concat mates which actually overlap?!?!
You're correct regarding requiring overlap. Nothing could accurately work without it unless the reads have already been mapped. FYI, it doesn't concatenate sequences, but merges them. Actually concatenating them wouldn't be very useful.
Is there any tool to merge (i.e concatenate R1 and R2)the aligned reads ? This is particularly helpful in population genomics if you do not want linked SNPs i.e people need one snp per loci. If Read 1 has a SNP, they do not want to consider the SNP from Read 2.
It would be simpler to just use a variant caller that won't do weird things in those cases. For example, samtools mpileup would set the phred score of the lower of those base calls to 0, which could then be trivially filtered.
To directly answer you question, probably though I don't know of one off-hand. In any case, that'd be simple to write.
I am trying in Pysam. It is giving some wired error discussed here Mate Pairs In Pysam
Presumably the mate really isn't in the BAM file. Try using grep with samtools to see.
Its working. For some reasons, my read names have
_1
and_2
which makes them different according to parser.That's not the best parser. Mates don't need to have the same name, so someone didn't think that out properly.
Am I correct in assuming that you've already aligned the reads?
At this point, no. Are you saying I should align the forwards and reverses against a reference first before attempting to join?
If they don't have significant overlap, then yes (it'd be impossible to accurately judge the distance between them otherwise).
Better yet, explain what purpose is this endeavour supposed to satisfy?
Were you able to determine a good method for doing this? I am also interesting in merging non-overlapping reads which are from a friend's sequencing run in which they used primers targeting a 643 bp fragment which unfortunately is slightly too large to produce an overlap during a MiSeq run.