Hello Everybody,
I'm trying to use gubbins software based on results obtained with mapping pipeline. I have 5 samples. For each sample reads, I mapped the reads to a reference fasta file of Taiwan19F-14 S. pneumoniae and called SNPs based on the GATK best practice. Using vcftools (vcf-consensus), we obtained sequences for Gubbins input. In addition, I identified large deletions using Pilon and mask the regions using bedtools. Finally, I obtained 5 sequences with the same length of 2,112,148 bps. I confirmed it following old topic.
(awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fasta)
However, when I started the Gubbins analysis as
run_gubbins.py input.fasta
then the following error message came
Checking input files...
Error with the input FASTA file: One of the sequences contains odd characters, only ACGTNacgtn- are permitted
Each sequence must have a name and some genomic data
There input alignment file does not exist or has an invalid format
I confirmed there are no characters except for ATCG and -.
How can I solve the problem?
Following https://askubuntu.com/questions/593383/how-to-count-occurrences-of-each-character you could simply count the occurrence of every character (excluding the header lines) and see what else expect
ACGTNacgtn-
comes up. For chr1 of the mm10 mouse genome with about 195mio bp this takes like 20 seconds or so on my machine.