Question

Gubbins does not work

0

Entering edit mode

4.0 years ago

fuguchan • 0

Hello Everybody,

I'm trying to use gubbins software based on results obtained with mapping pipeline. I have 5 samples. For each sample reads, I mapped the reads to a reference fasta file of Taiwan19F-14 S. pneumoniae and called SNPs based on the GATK best practice. Using vcftools (vcf-consensus), we obtained sequences for Gubbins input. In addition, I identified large deletions using Pilon and mask the regions using bedtools. Finally, I obtained 5 sequences with the same length of 2,112,148 bps. I confirmed it following old topic.

(awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fasta)

However, when I started the Gubbins analysis as

run_gubbins.py input.fasta

then the following error message came

        Checking input files...
Error with the input FASTA file: One of the sequences contains odd characters, only ACGTNacgtn- are permitted
Each sequence must have a name and some genomic data
There input alignment file does not exist or has an invalid format

I confirmed there are no characters except for ATCG and -.

How can I solve the problem?

Gubbins mapping Bacteria • 1.1k views

ADD COMMENT • link updated 3.9 years ago by Ram 44k • written 4.0 years ago by fuguchan • 0

0

Entering edit mode

Following https://askubuntu.com/questions/593383/how-to-count-occurrences-of-each-character you could simply count the occurrence of every character (excluding the header lines) and see what else expect ACGTNacgtn- comes up. For chr1 of the mm10 mouse genome with about 195mio bp this takes like 20 seconds or so on my machine.

awk '$1 !~ /^>/ {for (i=1;i<=NF;i++) a[$i]++} END{for (c in a) print c,a[c]}' FS="" input.fasta

ADD REPLY • link 4.0 years ago by ATpoint 85k