Calculate The Frequency Of Nucleotides At Each Position In An Mpileup File
9
9
Entering edit mode
10.7 years ago
komal.rathi ★ 4.1k

Hi everyone,

I have a number of Human Heart RNASeq samples. I have generated a mpileup file on each of them using a bed file containing a list of positions for which I want the mpileup to be made. I have included the -B option to turn off the BAQ computation in mpileup.

Single sample mpileup:

samtools mpileup -uf hg19.fa -B -l snps_chr17.bed sample1.bam > sample1.mpileup

Multiple sample mpileup (all samples):

samtools mpileup -uf hg19.fa -B -l snps_chr17.bed -b bam_files_list.bam > all_samples.mpileup

Now, I want to output the frequency of nucleotides A, T, G and C at each of those positions. There is a perl script pileup2baseindel which generates exactly the output that I want, only that it is incorrect (inconsistent with what is seen in IGV). This program takes a single/multi-sample mpileup file and produces the following output. But like I said, the output is incorrect.

I have searched a lot but can't seem to find any program that does the same. Does anyone know any tool/script that would give me such an output with/without taking the mpileup file as an input. Suggestions on alternatives/options are welcomed.

[UPDATE] There was a bug in the perl script and I have updated the code, it works fine now. I have contacted the author and as soon as the author responds, I will send it to him so that he can update it. I would still like to get suggestions and try out other programs.

[UPDATE] Thanks everyone who helped me with this problem. Each of your suggestions worked but I accepted Chris Miller's answer recommending the use of bam-readcount because of its easy of use and the elaborate output that it generates. It directly takes, as input, a bam file as well as a list of positions in bed format for which you want the nucleotide frequency. So you don't need to create an intermediate mpileup file - as in pileup2baseindel or run the program against positions that you are not interested in - as in pysamstats.

Thanks!

mpileup • 33k views
ADD COMMENT
6
Entering edit mode
10.7 years ago

Another option is just to run bam-readcount, giving it the bams and specified positions. It's more straightforward than creating an mpileup and parsing that ugly string.

ADD COMMENT
0
Entering edit mode

Thanks! I'll give it a try and let you know!

ADD REPLY
0
Entering edit mode

I'm using Bam-readcount, and I notice that the Depth calculated for each position does not correspond at all to the coverage that I observe on IGV. It is much lower. Could someone tell me where does the error? Thank you

ADD REPLY
0
Entering edit mode

Hi Saad.chadi, did you resolve the problem? I'm trying to run Bam-readcount, but i have the same problem. If you have any solution, please let me know. Thank in you in advance for your time. Regards

ADD REPLY
0
Entering edit mode

I was able to run bam-readcount, but in addition to counts of A, C, G, T, N, got lines like these with additional count at the end. What do these fields mean? Thanks!

chr6    1135939 T   1311    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:536:59.97:37.46:59.97:268:268:0.44:0.01:50.05:268:0.51:115.43:0.34    C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  T:775:60.00:37.71:60.00:387:388:0.43:0.00:1.53:387:0.56:106.57:0.38 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  +A:2:60.00:0.00:60.00:1:1:0.36:0.01:0.00:1:0.79:72.00:0.49
chr2    239563579   G   1528    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  C:685:60.00:37.66:60.00:303:382:0.47:0.02:75.21:303:0.75:104.37:0.52    G:841:60.00:37.77:60.00:386:455:0.46:0.00:1.38:386:0.75:100.58:0.51 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  -G:2:60.00:0.00:60.00:1:1:0.38:0.04:62.00:1:0.79:80.00:0.49
chr12   6945914 C   2365    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  C:1135:60.00:37.51:60.00:733:402:0.51:0.00:1.75:733:0.66:111.12:0.58    G:1228:60.00:37.55:60.00:786:442:0.56:0.01:38.74:786:0.65:107.84:0.57   T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  -C:2:60.00:0.00:60.00:1:1:0.41:0.02:0.00:1:0.17:54.00:0.48
ADD REPLY
0
Entering edit mode

They are insertions and deletions, IIRC. Easy enough to pull up your reads in IGV and verify that.

ADD REPLY
4
Entering edit mode
10.6 years ago

I wrote a tool to do exactly this; the function is called pileup2acgt, is written in python2.7 (compatible with the much faster pypy-2*). If you want to give it a try it works with no dependencies a part python2.7 core libraries.

The tool is part of Sequenza, a program to detect copy number and variants from tumor DNA-seq. The script, sequenza-utils.py, is downloadable from here, or from CRAN, at the sequenza page and following the instruction on the manual.

The general usage is available on an (already outdated) wiki.

I usually generate the pileup on the fly with samtools, this also allows to select a specific region of the BAM/SAM using samtools view (here with a slightly unnecessary samtools pipe querying TP53 site):

samtools view -f 2 -u my.bam chr17:7570000–7590000 | samtools rmdup - - | samools mpileup -f hg19.fa -q 20 -  |  sequenza-utils.py pileup2acgt - | gzip > result.txt.gz

a quick usage tip, for your purpose:

cat s.mpileup
1    95369073    G    254    C$.$.$c$,$,$CCccccCCC.CCC..,cccccCc,CCC,c,..,c.c.cCCc.$,,c.Cc,Cc,CCc,ccC.C,c,c,,ccCCcCCC,,cc,CCc,c.Ccc,,CC,C,c,,CCCCc.c,,.c,,,cC.Cc,cc,cc.c,.C,ccc,C,C,,cCCCcc.,,,c,cc,cccc,cccCcc,,cc,cc,,,cCcccccc.ccc,ccccc.,c,cc,,,cCccc.ccc,c,c,,c.ccCc,,,c,ccc,c,cccc,,,cc,,c,c    BA?GHGB-*,**8:9<><=CBG88888>8H=8@H8HD<H8A8B8<;8#HH8B?8H;8A:=8H88:>9H8F9HH88?8999?HH98H?=<H;F<::GH:=H<H8HH:?<:8B8HHB8HCH8>H88H8>G:8F8H@8H888H9H8=H8;:888EHHG:H:8F9:88H:88888HG:8H88HHH98<:;8;8H88<H;;8;8HH8F9:DHE88889B>68H<E<HG;E:88:HFH8E?>8H;H><9;FHH:>?@:9- 

python2.7 sequenza-utils.py pileup2acgt s.mpileup
chr    n_base    ref_base    read.depth    A    C    G    T    strand
1    95369073    G    254    0    152    95    0    0:47:22:0

It also check the quality of the bases, for example, if you want to lower the default threshold of quality (20):

python sequenza/exec/sequenza-utils.py pileup2acgt -q 10 s.mpileup
chr    n_base    ref_base    read.depth    A    C    G    T    strand
1    95369073    G    254    0    155    95    0    0:48:22:0

or anyway

python sequenza/exec/sequenza-utils.py pileup2acgt -h

will describe the available optional arguments.

It is reasonably fast as it is, it is also implemented with the multiprocessing library, so if it's needed it could use more processes to speed up a little bit. Used with pypy is faster than everything else that aims for a similar output, as far as I know.

the strand field, counts the reads on the forward strand for the specific base (A:C:G:T, respectively.)

The read depth is left unchanged from the mpileup, while the sum of ACGT is the number of reads above the quality filter (so it's <=read.depth).

It also handle lines with indels (although it passes on them without store them anywhere), it accepts .gz files or input from STDIN (using "-" as file name).

I actually use it to roughty call variants, and it's consistent with IGV.

Hope you find it useful.

ADD COMMENT
0
Entering edit mode

Hi favero!

I calculated each base as your methods as follwing:

samtools view -Duf my.bam | samtools rmdup - - | samools mpileup -f genome.fa -q 20 -  |  sequenza-utils.py pileup2acgt - | gzip > result.txt.gz

But I got some odd results and couldn't figure out why?

n_base    ref_base    read.depth    A    C    G    T    strand
16768     A           257           257  0    0    0    1:0:0:0
16769     C           253           0    252  0    0    0:1:0:0
16770     G           231           0    1    230  0    0:0:0:0
16771     G           219           0    0    219  0    0:0:0:0
16772     G           210           0    0    209  0    0:0:0:0
16773     A           179           76   0    0    0    0:0:0:0
16774     A           97            50   0    0    0    0:0:0:0
16775     C           57            0    38   0    0    0:0:0:0

as the results showed, the sum of the ACGT count on each base is not equal to the read.depth and the strand is 0:0:0:0. e.g. pos 16775, the sum of ACGT is 38, is less than the read.depth=57 and the strand is also 0:0:0:0

can you figure out why? thanks

ADD REPLY
0
Entering edit mode

Hi Lilian, this is awkward, I've just seen your message, almost one year after.

The depth and the sum of the bases differs because the depth is the read depth measured by samtools mpileup, while the bases recorded by the python script are only those passing the default quality score (which is adjustable using the -q argument).

My apologies for the late reply.

ADD REPLY
0
Entering edit mode

Hi Favero, I'm having the same problem than Lilian. I modified the -q score, but it doesn't work. What do you think about it? Thank you in advance for your time. Regards

ADD REPLY
0
Entering edit mode

Hi Lilian, I'm having the same problem. I modified the -q score, but it doesn't work. How do you resolve this? Thank you in advance for your help. Regards

ADD REPLY
0
Entering edit mode

You can check with IGV if in the specific positions are present indels or other things that can justify the mismatch from depth and nucleotide counts.

ADD REPLY
2
Entering edit mode
10.7 years ago

Hi- I used pysamstats briefly and it seems pretty handy.

With this command:

pysamstats \
    --fasta ref.fa \
    --type variation \
    aln.bam

You get a table with the following columns (self explanatory, _pp stands for properly paired):

chrom 
pos
ref
reads_all
reads_pp
matches
matches_pp
mismatches
mismatches_pp
deletions 
deletions_pp
insertions
insertions_pp
A
A_pp
C
C_pp
T
T_pp
G
G_pp
N
N_pp
ADD COMMENT
0
Entering edit mode

That looks promising! I'll give it a try and let you know. Thanks!

ADD REPLY
0
Entering edit mode

dariober: I used pysamstats and works very well. But even when using just a single chromosome using --chromosome, --start and --end options, it takes a long time. So I subset my bam file using a bed file which contains my positions of interest:

samtools view -hb sample.bam -L snp_pos.bed > sample_subset.bam

I have no clue why, but pysamstats is giving me an error:

File "/usr/local/bin/pysamstats", line 123, in <module>
window_offset=options.window_offset)
File "pysamstats.pyx", line 842, in pysamstats.write_variation (pysamstats.c:15103)
write_stats(stat_variation, fieldnames, *args, **kwargs)
File "pysamstats.pyx", line 3329, in pysamstats.write_stats (pysamstats.c:44847)
for rec in recs:
File "pysamstats.pyx", line 174, in stat_pileup_withref_default (pysamstats.c:6983)
it = samfile.pileup(reference=chrom, start=start, end=end, truncate=truncate)
File "csamtools.pyx", line 1254, in pysam.csamtools.Samfile.pileup (pysam/csamtools.c:14071)
ValueError: no index available for pileup

When I use the whole bam file/whole bam file with --chromosome, --start and --end it works fine. But when I subset my bam file and use it, it shows me an error. Have you ever dealt with such an error?

ADD REPLY
2
Entering edit mode

As I said, I used pysamstats only once or twice, so I'm just guessing... Looking at the last line of the error ValueError: no index available for pileup I suspect sample_subset.bam should be indexed with samtools index sample_subset.bam.

ADD REPLY
2
Entering edit mode
10.7 years ago


import sys,re,fileinput

Argument = []
Argument = sys.argv[1:] 

if (len(Argument)) < 3:
        print "Usage: Input_pileup_file Column_with_information_about_read_bases(usually_column_9_in_pileup) Output_file" 
        sys.exit()

File_Pileup = Argument[0]
index = int(Argument[1])-1
output = open(str(Argument[2]),"w")


nucleotides = ["A","T","C","G","a","t","c","g"]
complement = {'a':'T','g':'C','t':'A','c':'G'}

output.write("Chromosome\tCoordinate\tReference_base")

Frequency_bases = {"A":0,"T":0,"C":0,"G":0}

for base in sorted(Frequency_bases.keys()):
        output.write("\t"+str(base))

output.write("\n")

for line in fileinput.input([File_Pileup]):
        rowlist = []
        rowlist = (line.rstrip("\n")).split('\t')
        rowlist[2] = rowlist[2].upper()

        Frequency = {"A":0,"T":0,"C":0,"G":0}

        if line.startswith("#"):
                continue
        else:
                output.write(str(rowlist[0])+"\t"+str(rowlist[1])+"\t"+str(rowlist[2]))
                for i in rowlist[index]:
                        if i == ".":
                                Frequency[rowlist[2]] = Frequency[rowlist[2]] + 1
                                continue
                        if i == ",":
                                Frequency[rowlist[2]] = Frequency[rowlist[2]] + 1
                                continue

                        if i in nucleotides:
                                if i.isupper():
                                        Frequency[i] = Frequency[i]+1

                                else:
                                        Frequency[complement[i]] = Frequency[complement[i]] + 1

        for base in sorted(Frequency.keys()):
                output.write("\t"+str(Frequency[base]))

        output.write("\n")

output.close()

ADD COMMENT
0
Entering edit mode

The A,T,C,G counts are based on forward strand of the reference genome. In other words, if a read that was aligned on the reverse strand has base "T" for a position, then the frequency of "A" will be incremented as the reverse compliment of that read will render "A" to the forward strand.

ADD REPLY
0
Entering edit mode

I think this script will not handle correctly insertions and deletions. E.g. in mpileup line

seq3 200 A 20 ,,,,,..,.-4CACC.-4CACC....,.,,.^~. ==<<<<<<<<<<<::<;2<<

CACC will be counted as bases not as deletion.

Also, the character following ^ in the pileup should be skipped as it reports the mapping quality (which could be A, C, G, T).

ADD REPLY
0
Entering edit mode

Thanks! Looks good! I will give this a try too! Did you just write this code?

ADD REPLY
1
Entering edit mode

Yup, just wrote it. Please read my other comment to make sure that the code is behaving the way you want.

ADD REPLY
0
Entering edit mode

Yes I read it, although its not exactly what I want I can edit it in case I use it. I am currently trying to figure out pysamstats as suggested by dariober and bam-readcount as suggested by Chris Miller. Both the tools take bam file directly without creating an mpileup file.

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
9.7 years ago

You can use pileup and this function in R (with Rsamtools):

pileupFreq <- function(pileupres) {
    nucleotides <- levels(pileupres$nucleotide)
    res <- split(pileupres, pileupres$seqnames)
    res <- lapply(res, function (x) {split(x, x$pos)})
    res <- lapply(res, function (positionsplit) {
        nuctab <- lapply(positionsplit, function(each) {
                        chr = as.character(unique(each$seqnames))
                        pos = as.character(unique(each$pos))
                        tablecounts <- sapply(nucleotides, function (n) {sum(each$count[each$nucleotide == n])})
                        c(chr,pos, tablecounts)
                    })
        nuctab <- data.frame(do.call("rbind", nuctab),stringsAsFactors=F)
        rownames(nuctab) <- NULL
        nuctab
    })
    res <- data.frame(do.call("rbind", res),stringsAsFactors=F)
    rownames(res) <- NULL
    colnames(res) <- c("seqnames","start",levels(pileupres$nucleotide))
    res[3:ncol(res)] <- apply(res[3:ncol(res)], 2, as.numeric)
    res
}

Example:

library(VariantAnnotation)
library(Rsamtools)
vcf <- readVcf("vcfExample_chr17.vcf", "hg19")
bf <- BamFile("Pol2ChIP_chr17.bam")
res <- pileup(bf, scanBamParam=ScanBamParam(which=rowData(vcf))
pileupFreq(res)

Also, read more about it here: https://seqqc.wordpress.com/2015/03/10/calculate-nucelotide-frequency-with-rsamtools-pileup/

ADD COMMENT
0
Entering edit mode

inesdesantiago Can you get the pileup string, reference base, insertions and deletions at a particular position using Rsamtools?

ADD REPLY
0
Entering edit mode
10.7 years ago

If the output is inconsistent with what is seen in IGV, be sure that you're taking into account any mpileup params that might cause it to exclude reads (-Q is a likely candidate, which excludes bases with mapping qualities less than 13 by default)

ADD COMMENT
0
Entering edit mode

I forgot to mention that I used -B to turn off the BAQ computation when using samtools mpileup. I have edited my code to reflect that.

ADD REPLY
0
Entering edit mode

-Q is still set to 13 by default. Are there a bunch of low-quality bases that show up in IGV, but wouldn't be considered by samtools?

ADD REPLY
0
Entering edit mode

I haven't seen any low quality bases showing up in IGV when I was checking it. Although the problem is not only in the counts, it is also in the nucleotide itself. I went through the output of the perl script thorougly, there were positions that were consistent with IGV but there were also positions where sometimes the counts did not agree and sometimes even the nucleotide was different. For e.g. if IGV showed 5 Gs and 2 Ts, my output would show 5 As and 2 Cs and it wasn't even a pattern, it was just random.

I also tried it again by using the -B (turn off BAQ computation) and -Q 0 (skip bases with baseQ/BAQ smaller than 0) but nothing changed. So it has something to do with the perl script.

ADD REPLY
0
Entering edit mode
10.7 years ago

I am not sure if this is the case with you but the default settings for mpileup is to filter reads with bitwise flag 0X704. So for pileup generation the following reads will not considered at all from the bam files:

  1. 0x0400 (aka 1024 or "d") duplicate
  2. 0x0200 (aka 512 or "f") failed QC
  3. 0x0100 (aka 256 or "s") non primary alignment
  4. 0x0004 (aka 4 or "u") unmapped

May be IGV doesnt filter out some of these flags. As a result, you may be seeing extra alignments in IGV but not in your count file.

ADD COMMENT
0
Entering edit mode

Actually there was an error in the perl script. I can't reach the author but I have updated the code and it works fine. But I am still looking for alternatives because there are very few tools out there for this purpose.

ADD REPLY
0
Entering edit mode
10.0 years ago

Could you sent me the pileup2baseindel script that you changed? or tell me how can I find the bug?

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Hello Komal, could you send the pileup2baseindel script that you changed again? the link posted here doesn't work. appreciate your help!

ADD REPLY
0
Entering edit mode
22 months ago
Jan Röslein ▴ 10

Here is a multihreaded option for you!

#!/usr/bin/env python3

import argparse
import csv
import multiprocessing
from collections import Counter
from typing import List, Tuple

def count_bases(positions: List[Tuple[str, int, str, str]]) -> List[Tuple[str, int, int, int, int, int]]:
    base_counts = []
    for seq_name, pos, ref_base, reads in positions:
        counts = Counter(reads.replace(",", ref_base).replace(".", ref_base))
        base_counts.append((seq_name, pos, counts['A'], counts['C'], counts['G'], counts['T']))
    return base_counts


def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

def main(args):
    with open(args.input_file, 'r') as f:
        input_data = f.readlines()
    base_counts = []
    positions = []
    for line in input_data:
        elements = line.strip().split('\t')
        positions.append((elements[0], int(elements[1]), elements[2], elements[4]))
    with multiprocessing.Pool(processes=args.threads) as pool:
        base_counts = list(pool.imap(count_bases, chunks(positions, 1000)))
    base_counts = [item for sublist in base_counts for item in sublist]
    with open(args.output_file, 'w') as f:
        writer = csv.writer(f)
        writer.writerow(['Sequence name', 'Position', 'A', 'C', 'G', 'T'])
        writer.writerows(base_counts)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Count bases in mpileup file')
    parser.add_argument('input_file', help='Input mpileup file')
    parser.add_argument('output_file', help='Output csv file')
    parser.add_argument('-t', '--threads', default=multiprocessing.cpu_count(), type=int, help='Number of threads to use')
    args = parser.parse_args()
    main(args)
ADD COMMENT

Login before adding your answer.

Traffic: 2172 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6