Entering edit mode
3.7 years ago
K
▴
10
I would like to arrange blast outfmt -7 output file.
Column 1 contains accession ID
Column 2 – subject start
Column 3 – subject end
Column 4 – difference between column 2 and 3
Input file
ptg000001l 1714 4715 -3001
ptg000001l 3669 1932 1737
ptg000001l 4514 3725 789
ptg000001l 4839 5622 -783
ptg000001l 4840 5785 -945
ptg000001l 4840 5894 -1054
ptg000001l 4841 5751 -910
ptg000001l 4841 5785 -944
ptg000001l 4842 5542 -700
ptg000001l 4842 5784 -942
ptg000001l 4843 5409 -566
ptg000001l 4843 5659 -816
ptg000001l 4843 5665 -822
ptg000001l 4843 5776 -933
ptg000001l 4843 5784 -941
ptg000001l 4843 5894 -1051
ptg000001l 4843 6023 -1180
ptg000001l 4843 6333 -1490
I would like to collect only those accession which has same number ($2) and whose $3 is of larger length to assemble coordinates in continuous manner.
output file
ptg000001l 1714 4715 -3001
ptg000001l 4839 5622 -783
ptg000001l 4843 6333 -1490
Thank you Luke
Could you please explain
collect only those accession which has same number ($2) and whose $3 is of larger length to assemble coordinates in continuous manner.
? I was not able to get the requirements.sorry for that.column 2 ($2) has 4841 4841 4842 4842 4843 4843
since column 3 5751 5785 5542 5784 5409 6333
so for example, I would like to keep/collect only column 2 when column 3 is of larger length i.e 4843 6333
sorry..still didn't get the logic. 5751 (column 3) is bigger number than 4841 (column 2) and 5785 (column3) is bigger number for 4841 (based on column 2 grouping). Your output is supposed to include 4841 and 5785, 4842 and 5784 in addition to 4843 and 6333, based on the description above. In the original OP, record
ptg000001l 4839 5622 -783
satisfies your requirement above. But it's not in expected output. Based on the logic described above, output from OP data should be :not
Unless, I didn't get the logic correct.
basically, if the coordinates are overlapped then considered the start from the $2 and end from $3. I would like to assemble the coordinates from $2 and $3 and report if those which are missing.
output
ptg000001l 1714 4715 -3001
ptg000001l 4839 -- -783
ptg000001l -- 6333 -1490