Remove duplicate in fasta file based on sequence
2
0
Entering edit mode
23 months ago
martta95 ▴ 10

Hello,

I would like to remove duplicate in fasta file based on sequence, not header. The file is a large.

For example:

>A01968:16:HJM3MDSX3:1:1101:7654:1125 1:N:0:ATCACG
GCGTCTGTAGTCCAACGGTTAGGATAATTGCCTTCC
>A01968:16:HJM3MDSX3:1:1101:31096:1141 1:N:0:ATCACG
CTCAGTTTTGTAGTAGGACTCCCACTCTGACATGTT
>A01968:16:HJM3MDSX3:1:1101:27552:1204 1:N:0:ATCACG
CTCAGTTTTGTAGTAGGACTCCCACTCTGACATGTT
>A01968:16:HJM3MDSX3:1:1101:29830:1297 1:N:0:ATCACG
CTCAGTTTTGTAGTAGGACTCCCACTCTGACATGTT
>A01968:16:HJM3MDSX3:1:1101:6017:1329 1:N:0:ATCACG
ACGGGGCATTGTAAGTGAGATCGGAAGAGCCACGTC

and I would like to obtain a file containing only:

>A01968:16:HJM3MDSX3:1:1101:7654:1125 1:N:0:ATCACG
GCGTCTGTAGTCCAACGGTTAGGATAATTGCCTTCC
>A01968:16:HJM3MDSX3:1:1101:31096:1141 1:N:0:ATCACG
CTCAGTTTTGTAGTAGGACTCCCACTCTGACATGTT
>A01968:16:HJM3MDSX3:1:1101:6017:1329 1:N:0:ATCACG
ACGGGGCATTGTAAGTGAGATCGGAAGAGCCACGTC
fasta linux • 1.1k views
ADD COMMENT
0
Entering edit mode
23 months ago
GenoMax 148k

Use clumpify.sh from BBMap suite --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

It will work with fasta files. You will need to adjust memory allocation (-Xmx parameter) depending on size of your input file.

 clumpify.sh -Xmx8g in=input.fa out=deduped.fa dedupe
ADD COMMENT
0
Entering edit mode
23 months ago
madalton ▴ 10

I like seqkit for basic fasta/q manipulation.

seqkit rmdup --by-seq -o deduped.fa your_file.fa

You can set a number of threads with -j

ADD COMMENT

Login before adding your answer.

Traffic: 1579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6