Entering edit mode
2.5 years ago
sunnykevin97
▴
990
Hi,
I find a lot of duplicates in my SAM file, how do I remove them? Not by using Picard or samtools are they unable to remove them, any AWK command ?
I found one cmd - every time I had to give a duplicate entry (671) , I had more duplicates in the SAM file. How do I automate the process ?
awk 'BEGIN { i = 0; } /^@/ { if (/671/) { if (i++ < 1) { print; } } else { print } } /^[^@]/ { print }' AvA_.sam > AvA_fixed_.sam
Suggestions.
My 2p - unless you are experienced, don't use
sed
orawk
for manipulating vcf or sam files. There's almost certainly a more robust and less risky package that you could use.Any suggestions, I'm unable to find any such packages that do the job. Do you have anything in mind ?
samtools markup with -r option. You don't need to remove duplicates. If you mark duplicates, that is enough for downstream tools. Bamutil has dedup opion. Try that too.
Bamutil works fine.
what's a duplicate for you ? a normal way to remove the FLAG=duplicate would be
samtools view -F 1024 in.bam
might be worth looking in to the
BBMap
package , there is a sub-program that is calleddedupe.sh
, or you can even get there by usingBBduk
I assume.dedupe.sh only removes duplicates from fasta or fastq file. Both the programs can't remove duplicates from SAM files.
that's true indeed. my bad :/
(and yes, from sam to fastq, dedupe is too much work-around )
All the SAM files generated using BWA mem.
After de novo assembly, I choose assembled contigs as reference and mapped to the trimmed fastq reads.
I found a lot of duplicates only from the Velvet and the Abyss contigs, not from Spades.
My overall objective is to construct a META ASSEMBLY by combining all the assemblies into one, that's why I'm generating a SAM-->BAM file to feed into gam-ngs, which generates the met assembly finally.
I'm generating meta assembly because I had a fragment genome assembly, I'd like to improve the continuity of scaffolds.
With fragmented assembly, it's so troublesome to annotate the genome, it fails.