Entering edit mode
2.4 years ago
pablosolar.r
▴
20
Hi all,
I am starting to use samtools mpileup (I am new) and I need to find snps with a certain allele frequency.
For this I have a fasta and a bam of a specific region and I have done the following in python, with the command I have to use:
def mpileup(bam, gz, chrom, cord1, cord2):
"""
This function runs samtools mpileup for a certain file
"""
command = f'samtools mpileup -A -B -d 10000 --fasta-ref {gz} -r chr{chrom}:{cord1}-{cord2} -q 0 -Q 20 {bam}'
with Popen(command, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE, close_fds=True) as mpileup_results:
for line in mpileup_results.stdout:
print(line.decode('utf-8').strip())
This pipe produces the according mpileup output:
...
chr21 30585946 C 11 ,,AA.,.,.^:.^:. SV^]G``R`^a
chr21 30585947 A 11 ,,...,G,... __aaT\a]aab
chr21 30585948 C 12 a,...,.,...^:. babb\YaQ`aWa
chr21 30585949 A 12 ,,...,.,.... ^^aaY^a[aaaa
chr21 30585950 C 12 ,,...,.,.... S^ba_^aJ_^aa
chr21 30585951 A 12 ,,...,.,.... XPaa[[b]a[bV
chr21 30585952 C 12 ,,...,.,.... Y_aa_SbKbaa`
chr21 30585953 A 12 cc...,.,.... [S[_V`a_`ZaZ
...
But from here I'm stuck and I don't quite know how to proceed to find the SNPs and their allele frequency as this is the first time I'm working with this.
Could you help or give me any clue for this? Thank you very much!
Don't do this by hand. Use a variant caller.
samtools mpileup
is itself is afaik deprecated or at least superseded. These days people usebcftools
for all that, returning a VCF file with all these information you are asking about, see https://samtools.github.io/bcftools/howtos/variant-calling.htmlIf you are anyway using a python-ish way of wrapping this analysis you may consider learning snakemake, a workflow manager based on python with similar syntax which in the longterm will proof useful.
Hello again,
Thank you! So i've done the following:
And now I am getting the following:
But I don't see the AF parameter in INFO column. Am I missing something??? How could I calculate it??
On the other hand, each row corresponds to a SNP?
Thank you very much!!
It's a long time I did variant calling, others may comment please, but maybe something like https://samtools.github.io/bcftools/howtos/plugin.fill-tags.html
Hello ATPoint again!
First of all, thank you very much for your information and quick replies!
Well, I've deep into the url you shared to me and now I got correctly de AF thourgh this command:
Getting this output (some rows only)
My doubts now are the following:
Does each row correspond to a SNP or would you have to do some pre-processing of each row to get them? If so, what should be done?
There are quite a few rows that, as you can see, do not calculate the AF. Can they be discarded without fear? Is it valid to keep only those that have a value in AF?
It's a bit frustrating when you have no previous experience :)
Thank you very much to all!
Hello, If you perform calling on a single sample, you may want to filter on AC rather than AF. As a matter of fact, samtools does not compute AF if no alternate allele has been found. And just to have it clear, I don't really get the point in filtering on allelic frequency across 1 single sample. If you want to compute your allelic frequency across a population, you should use a
bcftools merge
prior to thefill-tags
plugin in order to compute your allelic frequency across all your samples.Or do you want to use an external allelic frequency (in Gnomad for example)?
Hello raphael!
The reality is that I have no idea, so I ask for help to see if you can make me understand a little.
The thing is that I have a fasta file for the chr21 and a bam file of the sequencing of that region and what I have to do is, through mpileup, to obtain those snps with an allele frequency between two values... I have managed to run mpileup and then get that vcf as I said, but the reality is that the theory I don't know and that's why I'm asking.
Thank you very much!
Question is: how do you define allele frequency? Is it allele frequency among your samples or among general population?
I don't know... I just received the fasta and the bam without any other information... beyond the duty to find snps with an allelic frequency range...
They probably want to know whether a variation is homo- or heterozygous and for this it is the AF per sample that matters.
Well, without this information it certainly is difficult to help you out... But if you just want to look sample per sample, Allele Count makes more sense than Allele Frequency in my opinion since you are in fact just trying to determine if genotypes are heterozygous or homozygous as ATpoint says
Thank you guys very much.
Although I don't quite understand all this, I have applied a filter to the resulting VCF so that it keeps those snps (rows) in which 'AF' appears in the INFO column and, after that, a total of 27 end up with AF values only 1 or 0.5. I don't know if it makes sense, but well....