Extracting transcript sequences with gene name (gffread)
2
2
Entering edit mode
6.3 years ago
cgias ▴ 20

Hello,

I've extracted the transcripts using gffread from Cufflinks package using

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

It gave me the sequences with the transcript_id but I'd like to have it with the gene_name. Is there any way to extract it straight with gene_name using gffread?

If not, how can I change transcript_id to gene_name using the gtf file? I know it's possible using the shell but I'm not sure how to tell it to search for the right gene and then change its identification.

The extracted sequences look like this

>rna30121
GGCGGATGTAGCCAAGTGGATCAAGGCAGTGGATTGTGAATCCACCATGCGCGGGTTCAATTCCCGTCAT
TCGCC
>gene20202 CDS=1-1062
ATGACTGCAATTTTAGAGAGACGCGAAAGCGAAAGCCTATGGGGTCGCTTCTGTAACTGGATAACCAGCA
CTGAAAACCGTCTTTACATTGGATGGTTTGGTGTTTTGATGATCCCTACCTTATTGACCGCAACTTCTGT
ATTTATTATCGCATTCATTGCTGCTCCTCCAGTAGATATTGATGGTATTCGTGAACCTGTTTCTGGATCT
C
>gene20203
GGGTTGCTAACTCAATGGTAGAGTACTCGGCTTTTATCCGACTAGTTCCGGGTTCGAGTCCCGGGCAACC
CA
>rna30122
GGGTTGCTAACTCAATGGTAGAGTACTCGGCTTTTAACCGACTAGTTCCGGGTTCGAGTCCCGGGCAACC
CA
>gene20204 CDS=1-1521
ATGGAGGAATTTCAAGTATATTTAGAACTAGATAGATTTCGGCAACACGACTTCCTATACCCACTTATTT
TTCGGGAGTATATTTATGCACTTGCTCATGATCATAGTTTAAATATAAATAATAGATCCGGTTTGTTGGA
A

And the gtf

NC_010323.1 RefSeq  exon    63  137 .   -   .   transcript_id "rna30121"; gene_id "gene20201"; gene_name "trnH-GUG";
NC_010323.1 RefSeq  CDS 592 1653    .   -   0   transcript_id "gene20202"; gene_id "gene20202"; gene_name "psbA";
NC_010323.1 RefSeq  exon    1924    1959    .   -   .   transcript_id "gene20203"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  exon    4534    4569    .   -   .   transcript_id "gene20203"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  exon    1924    1958    .   -   .   transcript_id "rna30122"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  exon    4533    4569    .   -   .   transcript_id "rna30122"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq  CDS 2266    3786    .   -   0   transcript_id "gene20204"; gene_id "gene20204"; gene_name "matK";

Thanks in advance

RNA-Seq • 11k views
ADD COMMENT
1
Entering edit mode

A simple hack is to try following on gtf (take a backup of existing gtf) and use it to annotate. Issue is that a gene can have multiple transcripts and one cannot simply replace a transcript with gene name as this would result in duplicate headers. Instead append gene name to transcript ID.

sed 's/"; gene_id "/_/g' transcripts.gtf> transcripts_new.gtf

Output fasta will have transcriptID_geneID as headers. If OP gtf has same transcript ID and gene ID then, try following on gtf:

sed 's/"\w\+[0-9]\+"; gene_id\s//g' transcripts.gtf> transcripts_new.gtf
ADD REPLY
2
Entering edit mode
6.3 years ago

Hello cgias,

I haven't seen an option to do it in one run. What you can do is take the fasta and replace the transcript_id by the gene_name. This can be done with awk like this:

$ awk -v FS="\t" 'NR==FNR {match($9, /transcript_id "([^"]+)"/ , t); match($9, /gene_name "([^"]+)"/, g); transcript[t[1]]=g[1]; next} {match($0, />([^ ]+)/, t); gsub(t[1], transcript[t[1]]); print}' transcripts.gtf input.fasta > output.fasta

What awk is doing here is to first go through the gtf file and creates a list where the transcript_id is the index and the gene_name the value. Next it go through the fasta file extract the transcript_id and replaces it with the corresponding value (=gene_name) from our list we' created before.

fin swimmer

ADD COMMENT
2
Entering edit mode
6.3 years ago

Try that

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf -F

where

-F  full GFF attribute preservation (all attributes are shown)
ADD COMMENT

Login before adding your answer.

Traffic: 1838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6