I'm trying to find the average fragment size for reads in my .bam file and want to also exclude read pairs with a fragment size of greater than 1kb (Note: not read length, but fragment size for the pair.) I put together a command line script but the output is all wrong; this is what i tried:
enter code here
$ samtools view -h SRR1611183.bam chr22 | awk 'length($10) > 1000 || $1 ~ /^@/' | samtools view -h - > bar2.bam
This basically gives me that same output as the idxstat command:
samtools idxstats SRR1611183.bam chr22
These are good for finding the sequence length but I'm seeking the fragment size.
Finally, I found a command line script that looked like a winner, but the output is all zeros?
samtools view -h SRR1611183.bam chr22 | awk -F'\t' '{if((/^@/) || (length($10)>1000)){print $0}}' | samtools stats | grep '^SN' | cut -f 2-
And here is the output (something is off):
raw total sequences: 0
filtered sequences: 0
sequences: 0
is sorted: 1
1st fragments: 0
last fragments: 0
reads mapped: 0
reads mapped and paired: 0 # paired-end technology bit set + both mates mapped
reads unmapped: 0
reads properly paired: 0 # proper-pair bit set
reads paired: 0 # paired-end technology bit set
reads duplicated: 0 # PCR or optical duplicate bit set
reads MQ0: 0 # mapped and MQ=0
reads QC failed: 0
non-primary alignments: 0
total length: 0 # ignores clipping
bases mapped: 0 # ignores clipping
bases mapped (cigar): 0 # more accurate
bases trimmed: 0
bases duplicated: 0
mismatches: 0 # from NM fields
error rate: 0.000000e+00 # mismatches / bases mapped (cigar)
average length: 0
maximum length: 30
average quality: 0.0
insert size average: 0.0
insert size standard deviation: 0.0
inward oriented pairs: 0
outward oriented pairs: 0
pairs with other orientation: 0
pairs on different chromosomes: 0
Anyone see where I'm off?
If you do a
length
on field$10
, you're measuring the read length, which will never be longer than 1000 unless you're working with long read technologies and therefore will never show up in your output. If you want fragment size, you need to use field$9
or calculate it yourself from the read position information, but that means you'd need to process the two lines of each pair together somehow (e.g. by sorting by read name and then comparing names when calculating the fragment size).