Mean read length from bam file
3
0
Entering edit mode
23 months ago
shinyjj ▴ 50

Hi biostars, I have a question about calculating the median read length from a bam file.

samtools view GTEX-1192X-0011-R10a-SM-DO941.bam | awk '{print length($10)}' | head -1000 | sort -u

Instead of the above command line, is it possible to get a median read length from a bam file?

bam samtools • 2.5k views
ADD COMMENT
1
Entering edit mode
23 months ago
samtools view -F 2304 in.bam | awk '{print length($10);}' | datamash median 1
ADD COMMENT
0
Entering edit mode

Awesome. Can you explain what is 2304 and datamash is?

ADD REPLY
1
Entering edit mode

-F 2304 excludes unmapped reads, and datamash is a command line program that makes it easier to perform actions (like column medians) from data in tabular format.

ADD REPLY
0
Entering edit mode

Great. Is it possible to calculate median length of total reads so that I can take account for unmapped + mapped?

ADD REPLY
1
Entering edit mode

Just remove that argument and run the same command.

ADD REPLY
0
Entering edit mode

-F 2304 excludes unmapped reads

No. It excludes supplementary and secondary reads.

ADD REPLY
0
Entering edit mode

This is what I get for being lazy and thinking I remember what it means instead of looking it up. Thanks for the correction.

ADD REPLY
0
Entering edit mode

Can someone explain the difference between unmapped reads and supplementary reads?

By looking at this discussion, it looks the below command gives the median read lenght of total reads. Can you confirm this, please?

samtools view in.bam | awk '{print length($10);}' | datamash median 1
ADD REPLY
1
Entering edit mode

The referenced command will output ALL reads in the BAM file, because there is no selection being applied. Supplementary reads are described here, and you might be interested in getting familiar with BAM flags by reading the specification or playing with a decoding utility.

ADD REPLY
1
Entering edit mode
23 months ago
seidel 11k

If you have littler installed, you could try that:

samtools view  mutant_ip.bam | head -1000 | awk '{print length($10)}' | r -de 'print(summary(X[,1]))'
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     76      76      76      76      76      76

(you would use the median() function in place of summary())

ADD COMMENT
0
Entering edit mode

Thanks a lot. Is this mapped or unmapped? Is it possible to calculate median length of total reads so that I can take account for unmapped + mapped?

ADD REPLY
0
Entering edit mode
19 months ago
obk • 0

I second the datamash solution.

You can avoid using the magic number $10 with samcut (link):

$ conda install -c bioconda samcut

$ samtools view mutant_ip.bam | samcut seq | awk '{print length}' | datamash median 1

(disclaimer: i wrote samcut)

ADD COMMENT

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6