Hi. I have a folder full of .fa files, and a .gff. The gff file contains information about which loci look like they code for RNA sequences. The .fa contain the DNA sequences for a set of human chromosomes. I want to get all the sequences which code for RNA, as defined by the gff file, out of the DNA in the fasta files. I also have a file telling me which RNA types have higher priority (lincRNA is higher priority than miRNA for example), this tells me which are more important and how I should decided between RNAs for overlapping reads in the gff.
I have been trying to code my own little program in F# that will read these files and give me each RNA read defined in the gff, and its corresponding DNA. However I am a bit confused about how it works. Do the start and end of each feature in the gff file define a character in the corresponding .fa file? Are they 1 or 0 indexed? Does it matter what strand they are ('+' or '-') for my purposes?
Ultimately my goal is to get a bunch of RNAs with their corresponding types (miRNA, lincRNA, snRNA... etc) to do some computations on.
My question is this: what is the easiest way to get it out of the data I have?
The data I am using is freely available here: http://wanglab.pcbi.upenn.edu/coral/ under the heading "Annotation packages" if anyone is interested or needs specifics.
Thank you!
Thank you! But I don't think this will break ties between overlapping reads? Also I would prefer to stick F#. I'm not looking for a pre-baked solution, I want to know the structure of fa and gff files so I can code it myself.
hmmm - your Q said nothing about "reads".... anyway, good luck with the f# ... you might want to reword you q to indicate this is required.
"I also have a file telling me which RNA types have higher priority (lincRNA is higher priority than miRNA for example), this tells me which are more important and how I should decided between RNAs for overlapping reads in the gff"