I have two tab-delimited files, File1 contains the gene coordinates and gene ids whereas File2 contains genome coordinates obtained via the output of ASTALAVISTA.
Structure of File1:
1 3631 5899 + AT1G01010
1 5928 8737 - AT1G01020
1 11649 13714 - AT1G01030
1 23146 31227 + AT1G01040
Structure of File2:
1 4276 4605 +
1 4486 5095 +
1 4276 5095 +
1 23138 24451 +
1 24451 24655 +
I have extracted the first four columns of file1 using the command (cut -f-4 File1 > File1_new) so that the structure of both files looks same:
1 3631 5899 +
1 5928 8737 -
1 11649 13714 -
1 23146 31227 +
Now I am using bedtools intersect to find File2 coordinates that are present inside the coordinates of File1 with gene ids from File1 using commands:
grep -w -f <(bedtools intersect -a File1_new -b File2 -wa | cut -f1,2,3,4) File1 > file_1.intersect
output:
1 3631 5899 + AT1G01010
1 5928 8737 - AT1G01020
1 11649 13714 - AT1G01030
1 23146 31227 + AT1G01040
grep -w -f <(bedtools intersect -a File1_new -b File2) File2 > file_2.intersect
output:
1 4276 4605 +
1 4486 5095 +
1 4276 5095 +
1 24451 24655 +
As a result, I am getting the coordinates (as you can see the coordinate ( 1 23138 24451 +) was not present within the range of File1 so it got skipped) but missing the gene ids, is it possible to get the gene ids also so that output looks like:
1 4276 4605 + AT1G01010
1 4486 5095 + AT1G01010
1 4276 5095 + AT1G01010
1 24451 24655 + AT1G01040
Secondly is it also possible to get the gene ids and distance of the closest gene for coordinates ( 1 23138 24451 +) that got skipped to get output like:
1 4276 4605 + AT1G01010 within
1 4486 5095 + AT1G01010 within
1 4276 5095 + AT1G01010 within
1 23138 24451 + AT1G01040 -8bp
1 24451 24655 + AT1G01040 within
Any help will be highly appreciated.
Getting the genes within those ranges would be reasonably easy.
Getting the nearest gene and outputting, say, -8bp away, would be tricker but not impossible. When you say it's x bp away though, how would you measure this? From the start of the gene? From the nearest bp within the CDS?
What if your coordinates are within a gene but don't fully cover it? What if the coordinates half cover a gene?
Many thanks for your response, I have checked the reason for ids that don't match. I have estimated the AS events from assemblies generated by stringtie. The coordinates in stringtie vary from TAIR10 coordinates, so I need to extract the gene coordinates from stringtie assembly and then have to assign gene ids based on those coordinates.
Then please answer my questions...