Question

How to check Fasta file ASCII characters and fix encoding errors?

0

Entering edit mode

2.7 years ago

O.rka ▴ 740

I tried building a diamond database but got this error.

Error: Error reading input stream at line 180825: Invalid character (ASCII 0) in sequence

How can I fix it? Is there a tool that checks for this and either repairs or removes the fasta record?

genomics fasta fastq • 1.8k views

ADD COMMENT • link updated 2.7 years ago by colindaven 7.0k • written 2.7 years ago by O.rka ▴ 740

0

Entering edit mode

tr '\0' ' ' < in.fa

but anyway, you should be very suspicious about the file. It's probably corrupted

ADD REPLY • link 2.7 years ago by Pierre Lindenbaum 164k

score 0 · Answer 1 · 2022-03-19

0

Entering edit mode

2.7 years ago

shenwei356 8.7k

Plain text:

sed -i 's/[\d0]//g'  xxx.fasta

Gzipped:

gzip -cd xxx.fasta.gz | sed -i 's/[\d0]//g' | gzip -c > clean.fasta.gz

ADD COMMENT • link 2.7 years ago by shenwei356 8.7k

score 0 · Answer 2 · 2022-03-19

I read and save in biopython to remove all the trash. Something like:

less biopy_reformat_fastq_remove_short.py

## Colin, Feb. 2018
## Remove short length sequences reported in the fastq (intended for Pacbio downsampling)


from Bio import SeqIO
import sys

good_seqs=[]
c=0

if len(sys.argv) <= 2:
        print "Enter input and output files \neg. python biopy_reformat_fasta.py input.fa output.fa"
else:
        for record in SeqIO.parse(open(sys.argv[1], "rU"), "fastq"):
                # Change the minimum read length
                minimum_length = 3000
                if len(record.seq) >= minimum_length :
                        record = record.upper()
                        good_seqs.append(record)
                else:
                        c = c + 1


print "Found short sequences:" + str(c)

output = open(sys.argv[2], "w")
SeqIO.write(good_seqs, output, "fastq")
output.close()