Question

Can I Convert Fasta Lowercase Bases To 'N'?

5

Entering edit mode

13.8 years ago

User 8087 ▴ 30

I have mutliple genome consensus sequences, extracted from a BWA BAM file using the following commandline:

samtools mpileup -E -q30 -Q30 -uf REF.fa isolate1.merged.bam | bcftools view -cg - |\
 vcfutils.pl vcf2fq -d3 > isolate1.fq (which I convert to FASTA using a python script).
(Samtools-0.1.18 used)

Whether or not I invoke the -d option in vcfutils.pl (default is read-depth 3), sites with a read depth of 1 or 2 are still called, just lowercase instead! I'd rather the base at these sites was not called ie 'N'.

This means that for each bacterial genome, I have approximately 48000 sites where a base has been called with a low read depth that I'd consider to be low quality. This includes approximately 650 variant sites that could be false positive SNPs (if they are sequencing errors covered by 1 read only!) and therefore potentially skew our phylogenetic analysis.

Is it possible to convert these lowercase letters to 'N' ie an uncalled base? I can only find tools that convert lowercase to uppercase and viceversa, I can't find any way to convert these low quality letters to 'N'? Of course my FASTA file is a mixture of uppercase bases and lowercase!

Failing that, perhaps does anyone know of a way I could call the consensus sequence from either the BAM file or the VCF file I have (for all sites), where it does filter out for minimum read depth?

Any help much appreciated!

Best Wishes,

Laura

samtools mpileup conversion read • 8.1k views

ADD COMMENT • link updated 13.4 years ago by Frédéric Mahé ★ 3.3k • written 13.8 years ago by User 8087 ▴ 30

score 13 · Answer 1 · 2012-04-06

13

Entering edit mode

13.4 years ago

Frédéric Mahé ★ 3.3k

With sed, you can do it like that:

sed -e '/^>/! s/[[:lower:]]/N/g' in.fasta > out.fasta

/^>/! selects lines not starting with >, [[:lower:]] selects any lowercase letter, with g sed replaces globally (all occurrences).

ADD COMMENT • link 13.4 years ago by Frédéric Mahé ★ 3.3k

score 8 · Answer 2 · 2011-10-31

8

Entering edit mode

13.8 years ago

Scopak ▴ 80

Try a Perl one-liner to convert lowercase to N:

perl -e 'while(<>) { if ($_ =~ /^>.*/) { print $_; } else { $_ =~ tr/acgt/N/; print $_;}}' < in.fasta  >out.fasta

Edit: Changed to tr///.

Regards,

-scott

ADD COMMENT • link 13.8 years ago by Scopak ▴ 80

1

Entering edit mode

tr/// is better than s/// here

ADD REPLY • link 13.8 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Masha, I agree, tr/// is faster for simple replacements than s///.

use tr/acgt/N/ instead of s/[acgt]/N/ in the code

ADD REPLY • link 13.8 years ago by Scopak ▴ 80

score 3 · Answer 3 · 2011-10-31

3

Entering edit mode

13.8 years ago

Martin A Hansen 3.0k

With Biopieces www.biopieces.org) you do:

read_fasta -i in.fna | transliterate_seq -s 'atcg' -r 'N' | write_fasta -o out.fna -x

ADD COMMENT • link 13.8 years ago by Martin A Hansen 3.0k