How to find identical sequences in genome fasta file (by Python or any possible program) ?
1
0
Entering edit mode
6 months ago
Sony ▴ 20

Hello everyone,

I have a genome fasta file which has 16,941 sequences. Here are example of my "genome.fasta":

>scf7180000026027
GAATGCATACTGCATCGATA

>scf7180000026028
CATAAAACGTCTCCATCGCT

>scf7180000026029
TGCCCAAGTTGTGAAGTGTC

>scf7180000026030
TGCCCAAGTTGTGAAGTGTC

I want to find identical sequences in this genome fasta file, and return their ids. My final purpose are find and remove any identical sequences present in my genome fasta file.

Thank you everyone for any suggestion.

fasta • 368 views
ADD COMMENT
1
Entering edit mode
6 months ago
GenoMax 147k

My final purpose are find and remove any identical sequences present in my genome fasta file.

You can use clumpify.sh from BBMap suite for this --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. It will accept fasta format sequences.

clumpify.sh -Xmx10g in=your_file.fa out=deduped_file.fa dedupe subs=0

subs=0 does perfect matches. Increase that number to allow mismatches.

You can use addcopies to mark headers with counts of sequences found like so

>scf7180000026027
GAATGCATACTGCATCGATA
>scf7180000026029 copies=3
TGCCCAAGTTGTGAAGTGTC
>scf7180000026028
CATAAAACGTCTCCATCGCT
ADD COMMENT

Login before adding your answer.

Traffic: 1518 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6