How to remove Duplicates in Sam file using AWK ?
0
0
Entering edit mode
2.5 years ago
sunnykevin97 ▴ 990

Hi,

I find a lot of duplicates in my SAM file, how do I remove them? Not by using Picard or samtools are they unable to remove them, any AWK command ?

I found one cmd - every time I had to give a duplicate entry (671) , I had more duplicates in the SAM file. How do I automate the process ?

awk 'BEGIN { i = 0; } /^@/ { if (/671/) { if (i++ < 1) { print; } } else { print } } /^[^@]/ { print }' AvA_.sam > AvA_fixed_.sam

Suggestions.

protein genome gene • 2.2k views
ADD COMMENT
2
Entering edit mode

My 2p - unless you are experienced, don't use sed or awk for manipulating vcf or sam files. There's almost certainly a more robust and less risky package that you could use.

ADD REPLY
0
Entering edit mode

Any suggestions, I'm unable to find any such packages that do the job. Do you have anything in mind ?

ADD REPLY
1
Entering edit mode

samtools markup with -r option. You don't need to remove duplicates. If you mark duplicates, that is enough for downstream tools. Bamutil has dedup opion. Try that too.

ADD REPLY
0
Entering edit mode

Bamutil works fine.

ADD REPLY
0
Entering edit mode

what's a duplicate for you ? a normal way to remove the FLAG=duplicate would be samtools view -F 1024 in.bam

ADD REPLY
0
Entering edit mode

might be worth looking in to the BBMap package , there is a sub-program that is called dedupe.sh , or you can even get there by using BBduk I assume.

ADD REPLY
0
Entering edit mode

dedupe.sh only removes duplicates from fasta or fastq file. Both the programs can't remove duplicates from SAM files.

ADD REPLY
0
Entering edit mode

that's true indeed. my bad :/

(and yes, from sam to fastq, dedupe is too much work-around )

ADD REPLY
0
Entering edit mode

All the SAM files generated using BWA mem.

After de novo assembly, I choose assembled contigs as reference and mapped to the trimmed fastq reads.

I found a lot of duplicates only from the Velvet and the Abyss contigs, not from Spades.

My overall objective is to construct a META ASSEMBLY by combining all the assemblies into one, that's why I'm generating a SAM-->BAM file to feed into gam-ngs, which generates the met assembly finally.

I'm generating meta assembly because I had a fragment genome assembly, I'd like to improve the continuity of scaffolds.

With fragmented assembly, it's so troublesome to annotate the genome, it fails.

ADD REPLY

Login before adding your answer.

Traffic: 2884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6