Question

Get intronic and intergenic sequences based on gff file

3

Entering edit mode

10.2 years ago

biolab ★ 1.4k

Dear all

I have a gff3 file with header shown below. The 3rd column indicates feature (exon, intron, UTR),the 4th and 5th columns indicate start and stop position, and the last column indicates detailed feature (eg which exon). I aim to extract intergenic and intronic sequences based on this file.

Note many genes overlap.How to achieve my goal?

Chr1 MSU_osa1r7 gene 2903 10817 . + . ID=LOC_Os01g01010;Name=LOC_Os01g01010;Note=TBC%20domain%20containing%20protein%2C%20expressed
Chr1 MSU_osa1r7 mRNA 2903 10817 . + . ID=LOC_Os01g01010.1;Name=LOC_Os01g01010.1;Parent=LOC_Os01g01010
Chr1 MSU_osa1r7 exon 2903 3268 . + . ID=LOC_Os01g01010.1:exon_1;Parent=LOC_Os01g01010.1
... ...

GFF • 25k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by biolab ★ 1.4k

0

Entering edit mode

Hello biolab!

Questions similar to yours can already be found at:

What is GFF3 and how to extract genomic sequences from it?

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.

Cheers!

ADD REPLY • link 10.2 years ago by Daniel Standage 4.1k

0

Entering edit mode

Questions like biolab's get asked again and again in many different versions. I think my answer below was specific to his exact questions (or at least one interpretation of it) which is not posed in your link.... and gave him specific advice on how to proceed. I certainly would not have answered as I did to that other question which was in some ways broader.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

The question I linked to answers a very similar (albeit more general) question. I'm typically in favor of pointing users to these so that useful answers don't get spread across different threads.

However, I think I misread the question when I closed it: I thought it said intronic and exonic sequences, rather than intergenic sequences. I agree that inferring the intergenic sequence adds a dimension to this question that warrants a separate thread, so I'll reopen.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Daniel Standage 4.1k

0

Entering edit mode

Hi Daniel, thank you very much for the link provided.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by biolab ★ 1.4k

GenoMax · Answer 1 · 2018-05-14

28

Entering edit mode

6.6 years ago

ATpoint 85k

Question is a few years old, but still relevant:

Intergenic is the easiest, as it is simply the complement of all features in a GTF/GFF file with the rest of the genome.

Note: This definition of intergenic is based purely on the GFF file entries. Of course there are promoters and other regulatory regions in the genome, with e.g. promoters right upstream of the first exon, but this is not annotated in a typical GFF so here, intergenic is simply the complement of everything in the GFF. If one wants to include promoters, one could define a certain window upstream of the gene start coordinate.

Requires the GFF/GTF and a BED file with the chromosome sizes. The latter, you can get based on the chromSizes files for your genome, e.g. from UCSC. It contains two columns, the first is the chromosome name, the second the number of basepairs on that chr. Make a BED file out of it:

awk 'OFS="\t" {print $1, "0", $2}' chromSizes.txt | sort -k1,1 -k2,2n > chromSizes.bed

Sort the GFF file:

cat in.gff | awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k4,4n -k5,5n"}' > in_sorted.gff

Get intergenic regions

bedtools complement -i in_sorted.gff -g chromSizes.txt > intergenic_sorted.bed

Next, intron is the complement of intergenic and exonic regions. Note: This definition of intron is (IMHO) sufficient, because the other features of a GFF (CDS, UTR, Start/Stop Codon) are all sub-intervals of "exon". Hence, everything that is not intergenic or exon must be intron. First, extract exonic coordinates in BED format:

awk 'OFS="\t", $1 ~ /^#/ {print $0;next} {if ($3 == "exon") print $1, $4-1, $5}' in_sorted.gff > exon_sorted.bed

Now use BEDtools complement to get the introns:

bedtools complement -i <(cat exon_sorted.bed intergenic_sorted.bed | sort -k1,1 -k2,2n) -g chromSizes.txt > intron_sorted.bed

This of course can be customized based on the information in your GFF. Again, as mentioned above, if you want intergenic to be more precise, you could add custom features such as promoters, enhancers or other regulatory elements to the GFF, and then again take the complement of this file with the rest of the genome.

Edit 08/18: just learned that there is a nice functionality in R to get introns. Check this post.

ADD COMMENT • link 3.8 years ago by ATpoint 85k

1

Entering edit mode

For the intergenic regions GFF file, I had to remove the middle '0' when creating chromSizes.bed so it's a genome file. Other than that, works like a charm.

ADD REPLY • link 4.3 years ago by elcortegano ▴ 200

0

Entering edit mode

Hi Can you help explain what code you used to remove the middle '0' - did you leave that entire column empty or remove the column entirely, or a maybe replaced the '0' with 1?

With the current instructions as is, I get this error at the intergenic bedtools step:

"chr1 has length equal to 0"

ADD REPLY • link 3.8 years ago by NBR • 0

0

Entering edit mode

Apologies, elcortegano is perfectly right and I did not saw their comment before. The genome file that bedtools expects is in fact a two-column file, the first one the chromosome and the second one the length. The chromSizes.bed from the above code is in BED format, so starts at zero (which is the start of every chromosome because BED is a zero-based format) and then to the end coordinate, which is the chromosome length. One should use chromSizes.txt not chromSizes.bed. I edited the above answer. Alternatively you could use -g <(cut -f1,3 chromSizes.bed)-

ADD REPLY • link 3.8 years ago by ATpoint 85k

0

Entering edit mode

Thanks so much!

And for the very last line of code:

bedtools complement -i <(cat exon_sorted.bed intergenic_sorted.bed | sort -k1,1 -k2,2n) -g chromSizes.bed > intron_sorted.bed

Do I similarly use ChromSizes.txt? At the moment i'm getting this error:

Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"

Even with the .txt file. I guess this problem is something else, but I can go through this whole process but get stuck at this line. edit:I think it's because the new input "exon_sorted.bed intergenic_sorted.bed is now a mixture of 3 field input (intergenic) and longer inputs(exon_sorted)

ADD REPLY • link updated 3.8 years ago by GenoMax 147k • written 3.8 years ago by NBR • 0

0

Entering edit mode

Found the same problem regarding the 'tokenizing data line" error. It looks like exon_sorted.bed was not tab-delimited but space delimited for me. Hopefully this is general and not specific to me! Changing this with tr as below allowed the line to fully execute.

awk 'OFS="\t", $1 ~ /^#/ {print $0;next} {if ($3 == "exon") print $1, $4-1, $5}' in_sorted.gff | tr [:blank:] \\t > exon_sorted.bed

ADD REPLY • link 3.8 years ago by tomdavy42 • 0

Ram · Answer 2 · 2014-09-10

Well, your problem is underspecified, since you don't say how you want to handle overlapping genes or transcripts.

However, in your toolkit you might want to have bedtools.

For example, if by "intergenic" you want all intervals that are not spanned by any gene, then you will want to use pull out just the gene entries from your gff3 (and then use bedtools complement which "returns all intervals in a genome that are not covered by at least one interval in the input BED/GFF/VCF file." (http://bedtools.readthedocs.org/en/latest/content/tools/complement.html)

You can then get the fasta for the resulting gff using bedtools getfasta which" extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file" (http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html)

Similarly, if you require that intronic regions must be introns in each and every transcript which spans it, you can extract your 'gene' and 'intron' features into separate gene.gff and intron.gff files, respectively, then use bedtools merge on each to create genic.gff and intronic.gff, and then use bedtools subtract to subtract the genic regions from the intronic regions.

If your definition of intronic and intergenic are otherwise, you'll probably find a way to compute them using bedtools.