Question

Non-standard nucleotides in my fast converted from a bam

0

Entering edit mode

9.9 years ago

biogirl ▴ 210

Hi,

I have converted my sorted aligned BAM file (paired-end short reads aligned to reference) to a consensus fasta file using the smalls mpileup, vcfutils and smalls seqtk pipeline, but I have some non standard nucleotides (i.e. not A,C,G,T) in my fasta file. I assume that these are artifacts of the BAM file. For instance, I have normal nucleotide sequence, and then I'll have a 'W', or a 'K'. Are these following the IUB/IUPAC nomenclature, or are they something different?

Thanks

dna fasta next-gen-sequencing bam • 2.8k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.9 years ago by biogirl ▴ 210

1

Entering edit mode

I actually wonder if those were inserted downstream of the BAM file, likely by vcfutils or seqtk. In order to have IUPAC codes in a BAM file either (1) the original read had to contain that code of (2) the aligner had to produce it. Both of those seem unlikely.

ADD REPLY • link 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

Very good point, thank you. I'll investigate the conversion process.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.9 years ago by biogirl ▴ 210

0

Entering edit mode

It seems it happens in the conversion of the BAM to the fastq. I guess this could be due to SNPs/ploidy and it can't decide what the consensus nucleotide is.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.9 years ago by biogirl ▴ 210

2

Entering edit mode

It looks like vcfutils is adding the IUPAC codes then: Questions Regarding Consensus Sequence Calling With Samtools / Bcftools / Vcfutils.Pl

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

That is incredibly handy, thanks for posting that link!

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.9 years ago by biogirl ▴ 210

0

Entering edit mode

No problem, you eliminated the other likely possibility!

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.9 years ago by Devon Ryan 104k

Ram · Answer 1 · 2015-01-09

1

Entering edit mode

9.9 years ago

Matt Shirley 10k

You mention that you are making a consensus FASTA file. In addition to Devon's comment, there is another source of IUPAC ambiguous base codes: your reference FASTA file. I'll wager a bet that you're using hg38:

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.9 years ago by Matt Shirley 10k

0

Entering edit mode

Note that based on the comment linked below, the UCSC version of hg38 replaces all ambiguous bases with 'N'.

GRCh38 BAM with hg38 VCFs

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.9 years ago by Matt Shirley 10k

0

Entering edit mode

Hi, I have checked my reference already, and there aren't any ambiguous bases in there (it's not hg38 either!). I'm currently checking the pipeline to see where it spits out these bases.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.9 years ago by biogirl ▴ 210