I am trying to extract out all the values for a specific quality filter MQRankSum. Someone has given a sed script showing how they did it.
Here is one row of my .txt file all located in column 8:
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL 0/1:246,99:345:99:2909,0
I am trying to extract out the values only of MQRankSum which. The sed script provided online was:
cut -f 8 | \
sed 's/^.*;MQRankSum=\(\-\{0,1\}[0-9]\{1,\}.[0-9]*\);.*$/\1/' > MQRankSum.txt
When I used that sed command I mostly extracted the values for MQRankSum but also ended left with rows of text that was missing a notation for MQRankSum:
0.000
AC=2;AF=1.00;AN=2;DP=195;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=0.915
-0.254
0.377
1.943
AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=13.87;SOR=2.303
-1.926
-14.951
-4.042
-7.347
-9.536
-3.781
0.637
I tried to debug the sed script but I am having trouble. I want to graph the MQRankSum values but cannot with the additional text values. What is missing from the sed script that will allow only numbers to pass through to the final .txt file?
with sed:
with awk:
with cut:
with grep (MQRankSum is always followed by QD):
.
I tried to extract values for ReadPosRankSum using same sed command:
Here are a few lines of the .txt file showing it extracted some but not all of the value even though ReadPosRankSum is present:
You're overcomplicating this. Your values are all delimited by semi-colons, you should use them.