MIRA: Assembly from fasta file with identical names
0
0
Entering edit mode
7.3 years ago
Tamandua • 0

I have a fasta file that includes sequences with identical names. I need to assemble contigs from these sequences using MIRA. However, Mira seems to have a problem with identical sequence names. Is there a way to solve this problem?

Mira Assembly fasta • 1.7k views
ADD COMMENT
1
Entering edit mode

Are these sequences (with identical names) identical? if so, you can use fastx_collapser from fastx-toolkit or rmdup from seqkit. If not, append an unique string to identical ids.

ADD REPLY
0
Entering edit mode

No, they are not identical. I used

blastn -db reference.fasta -query input.fasta -evalue 1e-60 -outfmt '6 qseqid qseq'     | awk 'BEGIN{FS="\t"; OFS="\n"}{gsub(/-/, "", $2); print ">"$1,$2}'     > output.fastq

to get the best matching hits. Maybe there is a better way of doing this? And how can I rename identical ids?

ADD REPLY
1
Entering edit mode

example fasta with identical headers

$ cat test.fa
>gene1
ATGCGGG
>gene1
TAGCTGT

Rename identical fasta ids:

$ seqkit rename test.fa

output:

>gene1
ATGCGGG
>gene1_2 gene1
TAGCTGT

download seqkit from here. Output can be redirected to either fa or fa.gz with -o option.

ADD REPLY
1
Entering edit mode

Just rename all sequences - just google rename fasta header. There are dozens of ways, see two at this post. With awk:

awk '/^>/ { printf("%s_%s\n",$0,i++);next;} { print $0;}' teste.fas > out.fas

or with R:

library(Biostrings)
fa = read.DNAStringSet(...)
names(fa) = make.unique(names(fa))
write.XStringSet(fa, ...)
ADD REPLY
0
Entering edit mode

Ah, thanks a lot! That's it!

ADD REPLY

Login before adding your answer.

Traffic: 2132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6