MIRA: Assembly from fasta file with identical names

0

Entering edit mode

7.3 years ago

Tamandua • 0

I have a fasta file that includes sequences with identical names. I need to assemble contigs from these sequences using MIRA. However, Mira seems to have a problem with identical sequence names. Is there a way to solve this problem?

Mira Assembly fasta • 1.7k views

ADD COMMENT • link 7.3 years ago by Tamandua • 0

1

Entering edit mode

Are these sequences (with identical names) identical? if so, you can use fastx_collapser from fastx-toolkit or rmdup from seqkit. If not, append an unique string to identical ids.

ADD REPLY • link 7.3 years ago by cpad0112 21k

0

Entering edit mode

No, they are not identical. I used

blastn -db reference.fasta -query input.fasta -evalue 1e-60 -outfmt '6 qseqid qseq'     | awk 'BEGIN{FS="\t"; OFS="\n"}{gsub(/-/, "", $2); print ">"$1,$2}'     > output.fastq

to get the best matching hits. Maybe there is a better way of doing this? And how can I rename identical ids?

ADD REPLY • link 7.3 years ago by Tamandua • 0

1

Entering edit mode

example fasta with identical headers

$ cat test.fa
>gene1
ATGCGGG
>gene1
TAGCTGT

Rename identical fasta ids:

$ seqkit rename test.fa

output:

>gene1
ATGCGGG
>gene1_2 gene1
TAGCTGT

download seqkit from here. Output can be redirected to either fa or fa.gz with -o option.

ADD REPLY • link 7.3 years ago by cpad0112 21k

1

Entering edit mode

Just rename all sequences - just google rename fasta header. There are dozens of ways, see two at this post. With awk:

awk '/^>/ { printf("%s_%s\n",$0,i++);next;} { print $0;}' teste.fas > out.fas

or with R:

library(Biostrings)
fa = read.DNAStringSet(...)
names(fa) = make.unique(names(fa))
write.XStringSet(fa, ...)