Nucleotide per-position from bam file and filter on quality
2
0
Entering edit mode
8.4 years ago
VHahaut ★ 1.2k

Hello!

I would like to extract from a bam file (or a series of bam) the number of ATCG at each position while filtering for the one that have a low phred score.

I found several approaching solutions (among them pysamstats) but not one that would do the job.

Can someone have a program/script that would solve this issue?

Thanks in advance!

SNP • 3.2k views
ADD COMMENT
2
Entering edit mode
8.4 years ago

I wrote a SAM2tsv: https://github.com/lindenb/jvarkit/wiki/SAM2Tsv, filter on quality, group by contig/pos

ADD COMMENT
0
Entering edit mode

A all-in-one package would have been better but yours did a perfectly good job. It'll just require some digging in the 16,443,247 lines that it produced^^

Thanks!

ADD REPLY
0
Entering edit mode

If someone is interested by the next step:

GOAL: For each position of a chromosome, find the ATCGN content while filtering for the low phred score:

# Python:
import time

# Phred values (Illumina)
qual=list("""!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI""")
score=range(0,41,1)
phred=dict(zip(qual, score))

# Threshold phred score once converted
threshold=20

# Chr or sequence length
length=1000000

# Create a list, long as length chr, containing ATCGN information for each position
# Take as input the sam2Tsv file from https://github.com/lindenb/jvarkit/wiki/SAM2Tsv

chr=[dict(zip(('A','T','C','G','N'), (0,0,0,0,0))) for x in range(1,length)]
input='/where/is/the/inputfile.tsv'
output='/where/to/output/xxxxxxx.txt'

# Report per position the the amount of ATCGN
start_time = time.time()

with open(input) as f:
        lines=f.readlines()
        for line in lines:
            if line.split("\t")[6].isdigit():
                if phred[line.split("\t")[5]] > threshold:
                    # If phred score is > than threshold, increase count for this base at this position
                    chr[int(line.split("\t")[6])][line.split("\t")[4]]=chr[int(line.split("\t")[6])][line.split("\t")[4]] + 1

print("--- %s seconds ---" % (time.time() - start_time))

with open(output, "w") as output:
    output.write('#ChrPos'+'\t'+'A'+'\t'+'T'+'\t'+'C'+'\t'+'G'+'\t'+'N'+'\n')
    for i in range(1,len(chr)):
        output.write(str(i)+'\t'+str(chr[i].values()).strip('[]').replace(', ','\t')+'\n')
ADD REPLY
0
Entering edit mode
4.3 years ago
khalilard • 0

Hi,

I know this post is old but I was looking for a solution to the same problem (preferably as a one-line bash code) and after a lot of head-scratching and googling, I came up with the following. I'm leaving it here for whomever wants it (or for my future self when I inevitably forget what I did):

samtools mpileup bamfile.bam -f reference.fa -r regions -q 20 -Q 20 | awk 'BEGIN{RS="\n";OFS="\t";ORS="\t"}{split("", count); a=split($5,chars,""); print $1,$2,$3,$4; for (u=1; u<=a; u++) count[chars[u]]++i; for (u in count) {printf "%s   ",u; printf "%s   ", count[u]}; print "\n"}'

The -q filters for base phred-scores and -Q filters for mapping quality if needed. The output table is not very elegant, but it will print the composition of each base pair position and their number according to the mpileup formatting. An example:

9   133730382   T   3365    ]  2    ^  2    ,  1815    G  2    .  1548    $  1
9   133730383   T   3363    ]  3    ^  3    ,  1813    .  1550    $  2

Two positions from an example BAM file (chromosome 9, positions 133730382 and 133730383) have a T with a depth of 3365 and 3363 respectively and have 1815 reads with the T in reverse and 1548 in forward for the first for the first position. Hope someone finds this helpful.

ADD COMMENT

Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6