Gubbins does not work
0
0
Entering edit mode
4.0 years ago
fuguchan • 0

Hello Everybody,

I'm trying to use gubbins software based on results obtained with mapping pipeline. I have 5 samples. For each sample reads, I mapped the reads to a reference fasta file of Taiwan19F-14 S. pneumoniae and called SNPs based on the GATK best practice. Using vcftools (vcf-consensus), we obtained sequences for Gubbins input. In addition, I identified large deletions using Pilon and mask the regions using bedtools. Finally, I obtained 5 sequences with the same length of 2,112,148 bps. I confirmed it following old topic.

(awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fasta)

However, when I started the Gubbins analysis as

run_gubbins.py input.fasta

then the following error message came

        Checking input files...
Error with the input FASTA file: One of the sequences contains odd characters, only ACGTNacgtn- are permitted
Each sequence must have a name and some genomic data
There input alignment file does not exist or has an invalid format

I confirmed there are no characters except for ATCG and -.

How can I solve the problem?

Gubbins mapping Bacteria • 1.1k views
ADD COMMENT
0
Entering edit mode

Following https://askubuntu.com/questions/593383/how-to-count-occurrences-of-each-character you could simply count the occurrence of every character (excluding the header lines) and see what else expect ACGTNacgtn- comes up. For chr1 of the mm10 mouse genome with about 195mio bp this takes like 20 seconds or so on my machine.

awk '$1 !~ /^>/ {for (i=1;i<=NF;i++) a[$i]++} END{for (c in a) print c,a[c]}' FS="" input.fasta
ADD REPLY

Login before adding your answer.

Traffic: 1688 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6