Hi, I downloaded the GTF file on the Gencode website and now I want to create a GFF file containing all 5'UTRs, which then will be subsequently used in htSeq.
I have the following problems with writing my code in the command line: - How to obtain the 5'UTRs of each transcript? How to deal with + and - strand? I know that the 5'UTRs are those at the 5' end of the transcript.
- There are in several cases with more than 2 UTRs per transcript. What to do with them?
This is the first scaffold of the final gff file, containing all UTRs.
awk '{OFS="\t"; if($3=="UTR"){print $1,$2,$3,$4,$5,".",$7,$10,$12}}' Geneannotation_all.gtf | sed 's/";//g; s/"//g' > Geneannotation_all.UTR.gff
From the htseq-count FAQ
For getting the UTRs I would use grep.
That probably depends on your biological research question, you could consider merging them.
There is the possibility to download a gtf file containing information about the 5' UTRs per transcript. How can I add the information? (I just find information on the exons, and cdsStart and End)
Sorry, add which information to what?
Sorry. I want to download a gtf and bed file containing the location of the 5'UTR per transcript, so that I can use it in htseq and bedtools for further analysis. I already have an alignment.
GTF/GFF files (which can be converted to bed) are available from Ensembl and contain UTR information. You could filter the file using grep.