I have downloaded the human genome and gtf files from Gencode. Based on these two files I want to generate a fasta file that has cDNA sequences including 5' and 3' UTRs for protein-coding genes only. What is the simplest and fastest way to do this?
Would this script work?
gffread -w transcripts.fa -g genome.fa transcripts.gtf
Thank you for your comments. At the end of the day, I tried this approach so this might be helpful for others as well. And I would also be happy to get your feedback in case I encountered any mistakes.
I downloaded both genome and gtf files from Gencode.
Preprocessed the gtf file and converted it to a bed format in the structure below (keeping only protein-coding transcripts):
chr start end transcript_name type strand.
Used bedtools to extract sequences in fasta format from the genome file using bedtools as bedtools getfasta -fi genome.fa -bed gencode_protein_coding.bed -name > hsa_protein_coding_transcripts.fa
cDNA is the "complementary" DNA to the mRNA transcript. mRNA transcripts include UTRs, so the cDNA sequence should too.