Hello, I have the human and mouse GENCODE annotations and the genome that goes with it. I am interested to create sequence and annotation files for individual genes.
For instance, if I am interested in RSAD2, I can easily find all annotations in the GTF file. What I want is to first extract the entire gene sequence forward sense into a separate fasta file (the gene feature from GENCODE). Then I want to extract all RSAD2 annotations (UTR, exon, CDS, etc.) from the original GENCODE file but modify coordinates so it matches my newly extracted fasta file.
Is there any tool that does this? I am trying to write a bash script but it is a pain so far....
Adrian
Very elegant, thanks! It seems your script may neglect strand info when it comes to annotation. Since getfasta gets a forward sequence for something that is minus stranded, the annotations likely needed to be reshaped as well. In your case, regardless of strand the coordinates are simply brought to the first column 4 coordinate of the gene feature. I was thinking, in case of negative strand the offset should be column 5 (last gene coordinate since it is actually the start of the feature). In that case, is it correct to subtract from that offset all other coordinates? i.e. offset - $4, offset - $5? Thanks again for the help! P.S. to speed grep up, is it faster to first awk the file for string gene in the third column and then grep for gene name? I guess if this needs to be speedy a separate file can be made of just gene features.