Remove/Delete unique reads from a DNA fasta file
2
0
Entering edit mode
5.9 years ago
saanasum ▴ 10

I have millions of fasta files each containing many DNA sequences.

I want to batch remove/delete reads (DNA sequences) that are unique in each fasta file, respectively. I only want to keep duplicate reads that occur at least two times in each fasta file, respectively.

Does anybody know a command line based solution or a program/script doing that?

Many thanks!

fasta unique reads remove DNA • 2.7k views
ADD COMMENT
0
Entering edit mode

seqkit common is what you are looking for.

ADD REPLY
0
Entering edit mode

I guess this is an excellent solution for finding common sequences between two or more files (I tried it). However, I just want to extract sequences that occur at least two times within ONE file containing many fasta sequences (OR even better: to remove unique sequences in ONE fasta file). Finally, this should be done for millions of fasta files.

ADD REPLY
0
Entering edit mode

However, seqkit might be the right choice when using the rmdup command and with -d parameter. seqkit rmdup

ADD REPLY
0
Entering edit mode
  • Create a for loop in bash over your files
  • In this loop, call a python script with one of your file as argument
  • In this python script, create a dictionnary that you will fill with sequence as key
  • For each sequence if it is already a key in your dictionnary output the sequence (which is a duplicate)
ADD REPLY
2
Entering edit mode
5.9 years ago
saanasum ▴ 10

Thanks @finswimmer for the suggestion of using seqkit.

It is possible as described here: seqkit rmdup. When using -d a file containing the duplicated reads can be specified. Using a for loop in bash should enable automation for millions of fasta files.

ADD COMMENT
0
Entering edit mode

Isn't this the opposite thing to what you asked?

You've asked to keep only reads that are at least duplicated or more. This will remove all the duplicates instead...perhaps I'm missing something...

ADD REPLY
0
Entering edit mode
-d, --dup-seqs-file string   file to save duplicated seqs.

The reads will be removed from original file and captured in new name specified is how I am reading this.

ADD REPLY
0
Entering edit mode

Ah yep i see, knew it was too late in the evening...

ADD REPLY
0
Entering edit mode

Exactly, using -d a new file containing only duplicate reads will be generated. Therefore, in this file all unique reads are deleted. I can proceed working with this new file. As batch in bash:

for i in *.fasta; do cat $i | seqkit rmdup -s -i -m -d "$i""_unique-reads-removed"; done
ADD REPLY
2
Entering edit mode
5.9 years ago
GenoMax 147k

dedupe.sh from BBMap suite should also work. outd= will collect duplicated sequences.

ADD COMMENT

Login before adding your answer.

Traffic: 1512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6