Hi all,
I would like to know how to extract paired reads from a bam file with the "wrong" insert size. I am extracting the reads with the "right" size with this command:
samtools view G_AGTCAA.bam | gawk ' and($2,0x0002) && and($2,0x0040)' | awk '{print $1,"\t", $2, "\t", $4,"\t" $8,"\t" $9}' |sed 's/-//g' > G_insert_size.txt
This command gives me the reads with the "correct" insert size
HWIST863:138:D0WT7ACXX:5:1101:1110:56612 15090 15326 320
HWIST863:138:D0WT7ACXX:5:1101:1168:81336 5294 5061 334
HWIST863:138:D0WT7ACXX:5:1101:1169:91247 15799 15625 275
HWIST863:138:D0WT7ACXX:5:1101:1207:97805 8616 8704 189
But what I would like to get are the reads which have the FLAGS 97 or 145, like these two:
HWI-ST863:138:D0WT7ACXX:5:2105:15878:99962 145 MT1 4599 37 101M = 16312 11612 CAATCTCACTTCTATGAAATAAAACTCCAGCAATACTAACTATAATCTCACTGATATTACTATCCCTAGGAGGCCTTCCACCACTAACAGGATTCTTACCA @CDCCACAEEAFEFFFEDHHHHGHIHJIGFGCA@/HCIHJJJIIJJJIHF?FDGD?HG?IHFCJJJJJJJJJIJJIHEGAGIFGHHGJHHHHHFFFDDCC@ XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:101
HWI-ST863:138:D0WT7ACXX:5:2105:15878:99962 97 MT1 16312 0 101M = 4599 -11612 TAATAACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTATCCCATAAACACAAAGGTTTGGTCCTGGCCTTATAATTAATTAGAGGTAAAATTAC CCCFFFFFHHHHHIJJJJJJJJJJJJJIJJJJJJJJJJJJJJJFHIHIJIJJIJJIHIJJJIFGIIIGGIIGIHHEHHGECEDEEFEDECEDCAC>AD@C@ XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:2 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:101 XA:Z:MT1,+13,101M,0;
this pair has an insert size of 11612.
I would like to know what hex strings I need to use to find this reads?
Thanks
Assa
I thought proper pair means the the insert size is within the "correct" size. But I find it very confusing to choose the right parameters for the correct FLAG. But i'll look into it. thanks