Non-standard nucleotides in my fast converted from a bam
1
0
Entering edit mode
9.9 years ago
biogirl ▴ 210

Hi,

I have converted my sorted aligned BAM file (paired-end short reads aligned to reference) to a consensus fasta file using the smalls mpileup, vcfutils and smalls seqtk pipeline, but I have some non standard nucleotides (i.e. not A,C,G,T) in my fasta file. I assume that these are artifacts of the BAM file. For instance, I have normal nucleotide sequence, and then I'll have a 'W', or a 'K'. Are these following the IUB/IUPAC nomenclature, or are they something different?

Thanks

dna fasta next-gen-sequencing bam • 2.8k views
ADD COMMENT
1
Entering edit mode

I actually wonder if those were inserted downstream of the BAM file, likely by vcfutils or seqtk. In order to have IUPAC codes in a BAM file either (1) the original read had to contain that code of (2) the aligner had to produce it. Both of those seem unlikely.

ADD REPLY
0
Entering edit mode

Very good point, thank you. I'll investigate the conversion process.

ADD REPLY
0
Entering edit mode

It seems it happens in the conversion of the BAM to the fastq. I guess this could be due to SNPs/ploidy and it can't decide what the consensus nucleotide is.

ADD REPLY
2
Entering edit mode

It looks like vcfutils is adding the IUPAC codes then: Questions Regarding Consensus Sequence Calling With Samtools / Bcftools / Vcfutils.Pl

ADD REPLY
0
Entering edit mode

That is incredibly handy, thanks for posting that link!

ADD REPLY
0
Entering edit mode

No problem, you eliminated the other likely possibility!

ADD REPLY
1
Entering edit mode
9.9 years ago

You mention that you are making a consensus FASTA file. In addition to Devon's comment, there is another source of IUPAC ambiguous base codes: your reference FASTA file. I'll wager a bet that you're using hg38:

ADD COMMENT
0
Entering edit mode

Note that based on the comment linked below, the UCSC version of hg38 replaces all ambiguous bases with 'N'.

GRCh38 BAM with hg38 VCFs

ADD REPLY
0
Entering edit mode

Hi, I have checked my reference already, and there aren't any ambiguous bases in there (it's not hg38 either!). I'm currently checking the pipeline to see where it spits out these bases.

ADD REPLY

Login before adding your answer.

Traffic: 2080 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6