Finding Variants in BAM file reads with mpileup
1
1
Entering edit mode
10.2 years ago
dnlsmy ▴ 10

I'm new to the Bioinformatics field and am still learning how to use all the tools and such, but I've been stumped on something that I feel should be very easy for about a week.

Basically, I want to compare the reads in a BAM file to a FASTA reference sequence and get the changes. I can easily view them with samtools tview but I need to have it in an Excel file for manipulation/concatenation. I can't seem to figure out why mpileup won't give me a straight list of what reads are mutated, unless that's the wrong tool for the job. I'm pretty good at programming, so if the solution is something I need to code for myself I could do it but since the formats are different than what I'm used to, it's been a major headache.

Thanks

What I'm dealing with

mpileup samtools SNP variants • 5.9k views
ADD COMMENT
1
Entering edit mode

Why do you want individual reads? While you can use the C or python API to get those easily enough, it's very likely that this isn't actually what you want to do (even if you think so). Why don't you tell us the end goal that you're after.

BTW, you should almost NEVER use Excel in bioinformatics. That's usually a sure sign that you're doing something wrong.

ADD REPLY
0
Entering edit mode

Thanks, I guess I gravitated towards Excel because at the end I would really like easily readable columns for my non-coding mentor.

I want at the end a table like this:

POSITION (ie X:XXXXXX) CONTEXT(BasePair before&after target) CHANGE(X->Y) READS AFFECTED(# Total) PERCENTAGE (% of total)
ADD REPLY
1
Entering edit mode

Right, so basically you want a VCF file. See Pierre's answer.

ADD REPLY
1
Entering edit mode

what you want is a VCF file. number of reads affected/non-affected will be provided in the DP4 field. The biologists in our lab manage their VCF files with http://www.knime.org .

ADD REPLY
5
Entering edit mode
10.2 years ago

you need to use bcftools to generate a tabular VCF file:

PS: no, you don't want to have it in an Excel file

PS2: no, you really don't want to have it in an Excel file

PS3: Every time you want to have it in an Excel file, God kills a kitten

PS4:

ADD COMMENT
0
Entering edit mode

The PPPS tweet is going to give me nightmares.

ADD REPLY
0
Entering edit mode

Haha, thanks for the laugh! My mentor wants it in Excel preferably so he can make charts, even though he knows it's going to be massive most likely. At the end, I need a table with these columns:

POSITION (ie X:XXXXXX)  CONTEXT(BasePair before&after target)  CHANGE(X->Y)  READS AFFECTED(# Total)  PERCENTAGE (% of total)

Position is being spit out via mpileup, but for some reason I can't seem to find change being displayed, or the # of reads affected. Any help would be welcome, I feel like I'm probably using the tool incorrectly.

ADD REPLY
1
Entering edit mode

see my references about bcftools. See my comment above about knime.org

ADD REPLY
0
Entering edit mode

Sorry to keep bothering you, but I'm going to need some more help with this. If I follow the new workflow, I get a very weird VCF, one where it lists all chromosome positions with correct Reference genome but an <x> for the ALT. If I run it with a couple of the options changed (following the BCFtools Github)

samtools mpileup -uf ref.fa aln.bam | bcftools call -mv -Oz > calls.vcf.gz

Which gives me a more correct looking file, with Reference and ALT specified. Unfortunately, I still don't quite understand how to get the Reads Affected and the Total Reads. DP is supposed to be total reads affected I believe but it doesn't line up from what's visible from

samtools tview

command, as it seems only 12 BP are changed when DP reads 16+ for a chromosome position value. Here's what I'm seeing

I know you can also query the VCF file, if you could tell me the exact tags to search for to get the

POSITION (ie X:XXXXXX)  CONTEXT(BasePair before&after target)  CHANGE(X->Y)  READS AFFECTED(# Total)  PERCENTAGE (% of total)

format I need I would be grateful. I can already get Position and change, but Reads Affected and Total Reads checked are having some issues since I don't know what tag they're under.

ADD REPLY
1
Entering edit mode

You essentially want the DP4 value, since this is the number of reads supported the reference and alt allele after filtering for quality. DP4 values are, in order, (1) reference allele forward, (2) reference allele reverse, (3) alt forward, (4) alt reverse. So the three examples you showed have, in order, 7, 7, and 28 reads supporting the reference allele and 15, 5, 22 supporting the alternate allele. Understandably then, they sample was called heterozygous at each of those positions.

ADD REPLY
0
Entering edit mode

Thanks!! I got bcftools query working and now that I know what the DP4 values mean, I have enough info for my table. Thanks for the clarification!

ADD REPLY

Login before adding your answer.

Traffic: 2633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6