How to check Fasta file ASCII characters and fix encoding errors?
2
0
Entering edit mode
2.7 years ago
O.rka ▴ 740

I tried building a diamond database but got this error.

Error: Error reading input stream at line 180825: Invalid character (ASCII 0) in sequence

How can I fix it? Is there a tool that checks for this and either repairs or removes the fasta record?

genomics fasta fastq • 1.8k views
ADD COMMENT
0
Entering edit mode
tr '\0' ' ' < in.fa

but anyway, you should be very suspicious about the file. It's probably corrupted

ADD REPLY
0
Entering edit mode
2.7 years ago

Plain text:

sed -i 's/[\d0]//g'  xxx.fasta

Gzipped:

gzip -cd xxx.fasta.gz | sed -i 's/[\d0]//g' | gzip -c > clean.fasta.gz
ADD COMMENT
0
Entering edit mode
2.7 years ago

I read and save in biopython to remove all the trash. Something like:

less biopy_reformat_fastq_remove_short.py

## Colin, Feb. 2018
## Remove short length sequences reported in the fastq (intended for Pacbio downsampling)


from Bio import SeqIO
import sys

good_seqs=[]
c=0

if len(sys.argv) <= 2:
        print "Enter input and output files \neg. python biopy_reformat_fasta.py input.fa output.fa"
else:
        for record in SeqIO.parse(open(sys.argv[1], "rU"), "fastq"):
                # Change the minimum read length
                minimum_length = 3000
                if len(record.seq) >= minimum_length :
                        record = record.upper()
                        good_seqs.append(record)
                else:
                        c = c + 1


print "Found short sequences:" + str(c)

output = open(sys.argv[2], "w")
SeqIO.write(good_seqs, output, "fastq")
output.close()
ADD COMMENT

Login before adding your answer.

Traffic: 1831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6