Can I Convert Fasta Lowercase Bases To 'N'?
3
5
Entering edit mode
13.2 years ago
User 8087 ▴ 30

I have mutliple genome consensus sequences, extracted from a BWA BAM file using the following commandline:

samtools mpileup -E -q30 -Q30 -uf REF.fa isolate1.merged.bam | bcftools view -cg - |\
 vcfutils.pl vcf2fq -d3 > isolate1.fq (which I convert to FASTA using a python script).
(Samtools-0.1.18 used)

Whether or not I invoke the -d option in vcfutils.pl (default is read-depth 3), sites with a read depth of 1 or 2 are still called, just lowercase instead! I'd rather the base at these sites was not called ie 'N'.

This means that for each bacterial genome, I have approximately 48000 sites where a base has been called with a low read depth that I'd consider to be low quality. This includes approximately 650 variant sites that could be false positive SNPs (if they are sequencing errors covered by 1 read only!) and therefore potentially skew our phylogenetic analysis.

Is it possible to convert these lowercase letters to 'N' ie an uncalled base? I can only find tools that convert lowercase to uppercase and viceversa, I can't find any way to convert these low quality letters to 'N'? Of course my FASTA file is a mixture of uppercase bases and lowercase!

Failing that, perhaps does anyone know of a way I could call the consensus sequence from either the BAM file or the VCF file I have (for all sites), where it does filter out for minimum read depth?

Any help much appreciated!

Best Wishes,

Laura

samtools mpileup conversion read • 7.6k views
ADD COMMENT
13
Entering edit mode
12.8 years ago

With sed, you can do it like that:

sed -e '/^>/! s/[[:lower:]]/N/g' in.fasta > out.fasta

/^>/! selects lines not starting with >, [[:lower:]] selects any lowercase letter, with g sed replaces globally (all occurrences).

ADD COMMENT
8
Entering edit mode
13.2 years ago
Scopak ▴ 80

Try a Perl one-liner to convert lowercase to N:

perl -e 'while(<>) { if ($_ =~ /^>.*/) { print $_; } else { $_ =~ tr/acgt/N/; print $_;}}' < in.fasta  >out.fasta

Edit: Changed to tr///.

Regards,

-scott

ADD COMMENT
1
Entering edit mode

tr/// is better than s/// here

ADD REPLY
0
Entering edit mode

Masha, I agree, tr/// is faster for simple replacements than s///.

use tr/acgt/N/ instead of s/[acgt]/N/ in the code

ADD REPLY
3
Entering edit mode
13.2 years ago

With Biopieces www.biopieces.org) you do:

read_fasta -i in.fna | transliterate_seq -s 'atcg' -r 'N' | write_fasta -o out.fna -x

ADD COMMENT

Login before adding your answer.

Traffic: 3613 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6