Python - Samtools mpileup find snps with allele frequency
0
0
Entering edit mode
2.4 years ago
pablosolar.r ▴ 20

Hi all,

I am starting to use samtools mpileup (I am new) and I need to find snps with a certain allele frequency.

For this I have a fasta and a bam of a specific region and I have done the following in python, with the command I have to use:

def mpileup(bam, gz, chrom, cord1, cord2):
    """
    This function runs samtools mpileup for a certain file
    """
    command = f'samtools mpileup -A -B -d 10000 --fasta-ref {gz} -r chr{chrom}:{cord1}-{cord2} -q 0 -Q 20 {bam}'

    with Popen(command, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE, close_fds=True) as mpileup_results:
        for line in mpileup_results.stdout:
            print(line.decode('utf-8').strip())

This pipe produces the according mpileup output:

...
chr21   30585946        C       11      ,,AA.,.,.^:.^:. SV^]G``R`^a
chr21   30585947        A       11      ,,...,G,...     __aaT\a]aab
chr21   30585948        C       12      a,...,.,...^:.  babb\YaQ`aWa
chr21   30585949        A       12      ,,...,.,....    ^^aaY^a[aaaa
chr21   30585950        C       12      ,,...,.,....    S^ba_^aJ_^aa
chr21   30585951        A       12      ,,...,.,....    XPaa[[b]a[bV
chr21   30585952        C       12      ,,...,.,....    Y_aa_SbKbaa`
chr21   30585953        A       12      cc...,.,....    [S[_V`a_`ZaZ
...

But from here I'm stuck and I don't quite know how to proceed to find the SNPs and their allele frequency as this is the first time I'm working with this.

Could you help or give me any clue for this? Thank you very much!

snp mpileup python samtools • 1.9k views
ADD COMMENT
0
Entering edit mode

Don't do this by hand. Use a variant caller. samtools mpileup is itself is afaik deprecated or at least superseded. These days people use bcftools for all that, returning a VCF file with all these information you are asking about, see https://samtools.github.io/bcftools/howtos/variant-calling.html

If you are anyway using a python-ish way of wrapping this analysis you may consider learning snakemake, a workflow manager based on python with similar syntax which in the longterm will proof useful.

ADD REPLY
0
Entering edit mode

Hello again,

Thank you! So i've done the following:

samtools mpileup -A -B -g -d 10000 --fasta-ref myfa.fa.gz -r chr21:9416181-48115040 -q 0 -Q 20 mybam.bam | bcftools call -m

And now I am getting the following:

...
chr21   47745213        .       A       .       54.5868 .       DP=1;MQ0F=0;AN=2;DP4=0,1,0,0;MQ=25      GT      0/0
chr21   47745214        .       G       .       79.5871 .       DP=2;MQSB=1;MQ0F=0;AN=2;DP4=1,1,0,0;MQ=25       GT      0/0
chr21   47745215        .       G       .       79.5871 .       DP=2;MQSB=1;MQ0F=0;AN=2;DP4=1,1,0,0;MQ=25       GT      0/0
chr21   47745216        .       C       .       79.5871 .       DP=2;MQSB=1;MQ0F=0;AN=2;DP4=1,1,0,0;MQ=25       GT      0/0
chr21   47745217        .       C       .       79.5871 .       DP=2;MQSB=1;MQ0F=0;AN=2;DP4=1,1,0,0;MQ=25       GT      0/0
chr21   47745218        .       G       .       79.5871 .       DP=2;MQSB=1;MQ0F=0;AN=2;DP4=1,1,0,0;MQ=25       GT      0/0
chr21   47745219        .       G       .       54.5868 .       DP=1;MQ0F=0;AN=2;DP4=1,0,0,0;MQ=25      GT      0/0
chr21   47745220        .       G       .       54.5868 .       DP=1;MQ0F=0;AN=2;DP4=1,0,0,0;MQ=25      GT      0/0
chr21   47745221        .       G       .       54.5868 .       DP=1;MQ0F=0;AN=2;DP4=1,0,0,0;MQ=25      GT      0/0
chr21   47745222        .       A       .       54.5868 .       DP=1;MQ0F=0;AN=2;DP4=1,0,0,0;MQ=25      GT      0/0
...

But I don't see the AF parameter in INFO column. Am I missing something??? How could I calculate it??

On the other hand, each row corresponds to a SNP?

Thank you very much!!

ADD REPLY
0
Entering edit mode

It's a long time I did variant calling, others may comment please, but maybe something like https://samtools.github.io/bcftools/howtos/plugin.fill-tags.html

ADD REPLY
0
Entering edit mode

Hello ATPoint again!

First of all, thank you very much for your information and quick replies!

Well, I've deep into the url you shared to me and now I got correctly de AF thourgh this command:

samtools mpileup -A -B -g -d 10000 --fasta-ref myfa.fa.gz -r chr21:9416181-48115040 -q 0 -Q 20 mybam.bam | bcftools call -m | bcftools +fill-tags -- -t AF

Getting this output (some rows only)

...
chr21   9825836 .   A   .   104.587 .   DP=4;MQ0F=0;AN=2;DP4=4,0,0,0;MQ=25  GT  0/0
chr21   9825837 .   C   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825838 .   T   C   5.62349 .   DP=3;VDB=0.02;SGB=-0.453602;RPB=1;MQB=1;BQB=1;MQ0F=0;AC=1;AN=2;DP4=1,0,2,0;MQ=25;AF=0.5 GT:PL   0/1:37,0,16
chr21   9825839 .   G   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825840 .   C   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825841 .   G   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825842 .   G   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825843 .   C   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825844 .   G   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825845 .   G   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825846 .   C   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825847 .   G   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825848 .   G   .   104.587 .   DP=4;MQ0F=0;AN=2;DP4=4,0,0,0;MQ=25  GT  0/0
chr21   9825849 .   T   G   16.3384 .   DP=5;VDB=0.0532881;SGB=-0.511536;RPB=0;MQB=1;BQB=0;MQ0F=0;AC=1;AN=2;DP4=2,0,3,0;MQ=25;AF=0.5    GT:PL   0/1:49,0,31
chr21   9825850 .   G   .   115.588 .   DP=5;MQ0F=0;AN=2;DP4=5,0,0,0;MQ=25  GT  0/0
chr21   9825851 .   G   .   115.588 .   DP=5;MQ0F=0;AN=2;DP4=5,0,0,0;MQ=25  GT  0/0
chr21   9825852 .   T   .   23.7614 .   DP=5;SGB=-0.379885;RPB=1;MQB=1;BQB=1;MQ0F=0;AN=2;DP4=4,0,1,0;MQ=25  GT  0/0
chr21   9825853 .   G   .   115.588 .   DP=5;MQ0F=0;AN=2;DP4=5,0,0,0;MQ=25  GT  0/0
chr21   9825854 .   G   .   115.588 .   DP=5;MQ0F=0;AN=2;DP4=5,0,0,0;MQ=25  GT  0/0
chr21   9825855 .   G   .   104.587 .   DP=4;MQ0F=0;AN=2;DP4=4,0,0,0;MQ=25  GT  0/0
chr21   9825856 .   G   .   104.587 .   DP=4;MQ0F=0;AN=2;DP4=4,0,0,0;MQ=25  GT  0/0
chr21   9825857 .   G   .   104.587 .   DP=4;MQ0F=0;AN=2;DP4=4,0,0,0;MQ=25  GT  0/0
chr21   9825858 .   G   .   74.587  .   DP=2;MQ0F=0;AN=2;DP4=2,0,0,0;MQ=25  GT  0/0
chr21   9825859 .   A   .   13.6275 .   DP=2;SGB=-0.379885;RPB=1;MQB=1;BQB=1;MQ0F=0;AN=2;DP4=1,0,1,0;MQ=25  GT  0/0
chr21   9825860 .   G   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825861 .   C   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
chr21   9825862 .   C   .   91.5872 .   DP=3;MQ0F=0;AN=2;DP4=3,0,0,0;MQ=25  GT  0/0
...

My doubts now are the following:

  • Does each row correspond to a SNP or would you have to do some pre-processing of each row to get them? If so, what should be done?

  • There are quite a few rows that, as you can see, do not calculate the AF. Can they be discarded without fear? Is it valid to keep only those that have a value in AF?

It's a bit frustrating when you have no previous experience :)

Thank you very much to all!

ADD REPLY
0
Entering edit mode

Hello, If you perform calling on a single sample, you may want to filter on AC rather than AF. As a matter of fact, samtools does not compute AF if no alternate allele has been found. And just to have it clear, I don't really get the point in filtering on allelic frequency across 1 single sample. If you want to compute your allelic frequency across a population, you should use a bcftools merge prior to the fill-tags plugin in order to compute your allelic frequency across all your samples.

Or do you want to use an external allelic frequency (in Gnomad for example)?

ADD REPLY
0
Entering edit mode

Hello raphael!

The reality is that I have no idea, so I ask for help to see if you can make me understand a little.

The thing is that I have a fasta file for the chr21 and a bam file of the sequencing of that region and what I have to do is, through mpileup, to obtain those snps with an allele frequency between two values... I have managed to run mpileup and then get that vcf as I said, but the reality is that the theory I don't know and that's why I'm asking.

Thank you very much!

ADD REPLY
0
Entering edit mode

Question is: how do you define allele frequency? Is it allele frequency among your samples or among general population?

ADD REPLY
0
Entering edit mode

I don't know... I just received the fasta and the bam without any other information... beyond the duty to find snps with an allelic frequency range...

ADD REPLY
0
Entering edit mode

They probably want to know whether a variation is homo- or heterozygous and for this it is the AF per sample that matters.

ADD REPLY
0
Entering edit mode

Well, without this information it certainly is difficult to help you out... But if you just want to look sample per sample, Allele Count makes more sense than Allele Frequency in my opinion since you are in fact just trying to determine if genotypes are heterozygous or homozygous as ATpoint says

ADD REPLY
0
Entering edit mode

Thank you guys very much.

Although I don't quite understand all this, I have applied a filter to the resulting VCF so that it keeps those snps (rows) in which 'AF' appears in the INFO column and, after that, a total of 27 end up with AF values only 1 or 0.5. I don't know if it makes sense, but well....

ADD REPLY

Login before adding your answer.

Traffic: 1758 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6