Question

How to get human cDNA sequences together with UTR regions?

0

Entering edit mode

2.2 years ago

Apex92 ▴ 320

Dear all,

I have downloaded the human genome and gtf files from Gencode. Based on these two files I want to generate a fasta file that has cDNA sequences including 5' and 3' UTRs for protein-coding genes only. What is the simplest and fastest way to do this?

Would this script work? gffread -w transcripts.fa -g genome.fa transcripts.gtf

Thank you.

transcripts cDNA genome rna-seq • 1.6k views

ADD COMMENT • link updated 2.2 years ago by Istvan Albert 102k • written 2.2 years ago by Apex92 ▴ 320

0

Entering edit mode

cDNA is the "complementary" DNA to the mRNA transcript. mRNA transcripts include UTRs, so the cDNA sequence should too.

ADD REPLY • link 2.2 years ago by i.sudbery 21k

score 0 · Answer 1 · 2023-03-16

0

Entering edit mode

2.2 years ago

Istvan Albert 102k

The command you specify will concatenate the exon sequence for each transcript and as long as the exons contain the UTRs you will get those.

To keep only protein-coding exons, you might need to preprocess the GTF file to keep only those that have gene_type "protein_coding" tag.

In general, though I would recommend downloading the CDNA files from the same source and filtering that with some other method.

ADD COMMENT • link 2.2 years ago by Istvan Albert 102k

0

Entering edit mode

This probably wants to be transcript_biotype, not gene_biotype as its possible to have non-coding transcripts of coding genes.

ADD REPLY • link 2.2 years ago by i.sudbery 21k

score 0 · Answer 2 · 2023-03-16

0

Entering edit mode

2.2 years ago

Apex92 ▴ 320

Thank you for your comments. At the end of the day, I tried this approach so this might be helpful for others as well. And I would also be happy to get your feedback in case I encountered any mistakes.

I downloaded both genome and gtf files from Gencode.
Preprocessed the gtf file and converted it to a bed format in the structure below (keeping only protein-coding transcripts): chr start end transcript_name type strand.
Used bedtools to extract sequences in fasta format from the genome file using bedtools as
bedtools getfasta -fi genome.fa -bed gencode_protein_coding.bed -name > hsa_protein_coding_transcripts.fa

ADD COMMENT • link 2.2 years ago by Apex92 ▴ 320

0

Entering edit mode

as far as I know the bedtools getfasta can only concatenate exons if you had it in the 12 column format with block information,

if all you had was the 6 column BED as you describe it, then how could it identify the exons that form a transcript?

I believe the method that you describe will generate the unspliced transcript

ADD REPLY • link 2.2 years ago by Istvan Albert 102k