Question is a few years old, but still relevant:
Intergenic is the easiest, as it is simply the complement of all features in a GTF/GFF file with the rest of the genome.
Note: This definition of intergenic is based purely on the GFF file entries. Of course there are promoters and other regulatory regions in the genome, with e.g. promoters right upstream of the first exon, but this is not annotated in a typical GFF so here, intergenic is simply the complement of everything in the GFF. If one wants to include promoters, one could define a certain window upstream of the gene start coordinate.
Requires the GFF/GTF and a BED file with the chromosome sizes. The latter, you can get based on the chromSizes files for your genome, e.g. from UCSC. It contains two columns, the first is the chromosome name, the second the number of basepairs on that chr. Make a BED file out of it:
awk 'OFS="\t" {print $1, "0", $2}' chromSizes.txt | sort -k1,1 -k2,2n > chromSizes.bed
Sort the GFF file:
cat in.gff | awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k4,4n -k5,5n"}' > in_sorted.gff
Get intergenic regions
bedtools complement -i in_sorted.gff -g chromSizes.txt > intergenic_sorted.bed
Next, intron is the complement of intergenic and exonic regions. Note: This definition of intron is (IMHO) sufficient, because the other features of a GFF (CDS, UTR, Start/Stop Codon) are all sub-intervals of "exon". Hence, everything that is not intergenic or exon must be intron.
First, extract exonic coordinates in BED format:
awk 'OFS="\t", $1 ~ /^#/ {print $0;next} {if ($3 == "exon") print $1, $4-1, $5}' in_sorted.gff > exon_sorted.bed
Now use BEDtools complement to get the introns:
bedtools complement -i <(cat exon_sorted.bed intergenic_sorted.bed | sort -k1,1 -k2,2n) -g chromSizes.txt > intron_sorted.bed
This of course can be customized based on the information in your GFF. Again, as mentioned above, if you want intergenic to be more precise, you could add custom features such as promoters, enhancers or other regulatory elements to the GFF, and then again take the complement of this file with the rest of the genome.
Edit 08/18: just learned that there is a nice functionality in R to get introns. Check this post.
Hello biolab!
Questions similar to yours can already be found at:
We have closed your question to allow us to keep similar content in the same thread.
If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.
Cheers!
Questions like biolab's get asked again and again in many different versions. I think my answer below was specific to his exact questions (or at least one interpretation of it) which is not posed in your link.... and gave him specific advice on how to proceed. I certainly would not have answered as I did to that other question which was in some ways broader.
The question I linked to answers a very similar (albeit more general) question. I'm typically in favor of pointing users to these so that useful answers don't get spread across different threads.
However, I think I misread the question when I closed it: I thought it said intronic and exonic sequences, rather than intergenic sequences. I agree that inferring the intergenic sequence adds a dimension to this question that warrants a separate thread, so I'll reopen.
Hi Daniel, thank you very much for the link provided.