How to get human cDNA sequences together with UTR regions?
2
0
Entering edit mode
21 months ago
Apex92 ▴ 320

Dear all,

I have downloaded the human genome and gtf files from Gencode. Based on these two files I want to generate a fasta file that has cDNA sequences including 5' and 3' UTRs for protein-coding genes only. What is the simplest and fastest way to do this?

Would this script work? gffread -w transcripts.fa -g genome.fa transcripts.gtf

Thank you.

transcripts cDNA genome rna-seq • 1.3k views
ADD COMMENT
0
Entering edit mode

cDNA is the "complementary" DNA to the mRNA transcript. mRNA transcripts include UTRs, so the cDNA sequence should too.

ADD REPLY
0
Entering edit mode
21 months ago

The command you specify will concatenate the exon sequence for each transcript and as long as the exons contain the UTRs you will get those.

To keep only protein-coding exons, you might need to preprocess the GTF file to keep only those that have gene_type "protein_coding" tag.

In general, though I would recommend downloading the CDNA files from the same source and filtering that with some other method.

ADD COMMENT
0
Entering edit mode

This probably wants to be transcript_biotype, not gene_biotype as its possible to have non-coding transcripts of coding genes.

ADD REPLY
0
Entering edit mode
21 months ago
Apex92 ▴ 320

Thank you for your comments. At the end of the day, I tried this approach so this might be helpful for others as well. And I would also be happy to get your feedback in case I encountered any mistakes.

  1. I downloaded both genome and gtf files from Gencode.

  2. Preprocessed the gtf file and converted it to a bed format in the structure below (keeping only protein-coding transcripts): chr start end transcript_name type strand.

  3. Used bedtools to extract sequences in fasta format from the genome file using bedtools as
    bedtools getfasta -fi genome.fa -bed gencode_protein_coding.bed -name > hsa_protein_coding_transcripts.fa

ADD COMMENT
0
Entering edit mode

as far as I know the bedtools getfasta can only concatenate exons if you had it in the 12 column format with block information,

if all you had was the 6 column BED as you describe it, then how could it identify the exons that form a transcript?

I believe the method that you describe will generate the unspliced transcript

ADD REPLY

Login before adding your answer.

Traffic: 1730 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6