Question

How to extract promoter sequences from rice transcriptome.gtf file ?

0

Entering edit mode

3.1 years ago

isha.lily20 ▴ 10

Hello researchers,

I am stuck in my project and require an effective solution

How to extract promoter sequences from rice transcriptome.gtf file?
How to extract promoter sequences 2kb from rice transcriptome.gtf file?
How to extract promoter sequences downstream 2kb from rice transcriptome.gtf file?

Thank you

gtf • 2.5k views

ADD COMMENT • link updated 3 months ago by Ram 44k • written 3.1 years ago by isha.lily20 ▴ 10

2

Entering edit mode

Do you have chromosome/scaffold lengths for rice? and post lines for which you would need upstream and downstream elements. You would need each chromosome/scaffold length, genome sequence and bedtools. use functions flank and getfasta from bedtools.

ADD REPLY • link 3.1 years ago by cpad0112 21k

2

Entering edit mode

Basically promoter means up stream of the TSS, and TSS is the annotated start of each transcript. Hence, get start coordinates per transcript (it is the "end" coordinate of in the - strand), and then get 500bp upstream which is like the default for promoter approximation. Then use mentioned tools to get fasta sequences.

ADD REPLY • link 3.0 years ago by ATpoint 85k

score 1 · Answer 1 · 2021-11-30

Hi,

You may try the CLI interface (gtftk) of Python GTF toolkit. Although it may be slower it offers additional arguments to transfert transcript informations into the 4th colum.

gtftk get_example | gtftk select_by_key -k feature -v transcript | gtftk get_5p_3p_coords -n gene_id,transcript_id  -m promoter -s '|'

Best

Disclosure: I'm the pygtftk developer.

score 1 · Answer 2 · 2021-11-30

You might find usefull information here https://github.com/NBISweden/AGAT/issues/89 and here Extracting genomic feature sequences from GTF/GFF files with AGAT

To get the 2kb upstream region from tss with AGAT:
agat_sp_extract_sequences.pl --gff input.gff --fasta input.fasta -t transcript --eo --up "2000"

To get the 2kb downstreamregion from tss with AGAT:
agat_sp_extract_sequences.pl --gff input.gff --fasta input.fasta -t transcript --eo --down "2000"

*replace transcript by mRNA depending how it is called in te 3rd column of your file.