Hello,
I've extracted the transcripts using gffread from Cufflinks package using
gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf
It gave me the sequences with the transcript_id but I'd like to have it with the gene_name. Is there any way to extract it straight with gene_name using gffread?
If not, how can I change transcript_id to gene_name using the gtf file? I know it's possible using the shell but I'm not sure how to tell it to search for the right gene and then change its identification.
The extracted sequences look like this
>rna30121
GGCGGATGTAGCCAAGTGGATCAAGGCAGTGGATTGTGAATCCACCATGCGCGGGTTCAATTCCCGTCAT
TCGCC
>gene20202 CDS=1-1062
ATGACTGCAATTTTAGAGAGACGCGAAAGCGAAAGCCTATGGGGTCGCTTCTGTAACTGGATAACCAGCA
CTGAAAACCGTCTTTACATTGGATGGTTTGGTGTTTTGATGATCCCTACCTTATTGACCGCAACTTCTGT
ATTTATTATCGCATTCATTGCTGCTCCTCCAGTAGATATTGATGGTATTCGTGAACCTGTTTCTGGATCT
C
>gene20203
GGGTTGCTAACTCAATGGTAGAGTACTCGGCTTTTATCCGACTAGTTCCGGGTTCGAGTCCCGGGCAACC
CA
>rna30122
GGGTTGCTAACTCAATGGTAGAGTACTCGGCTTTTAACCGACTAGTTCCGGGTTCGAGTCCCGGGCAACC
CA
>gene20204 CDS=1-1521
ATGGAGGAATTTCAAGTATATTTAGAACTAGATAGATTTCGGCAACACGACTTCCTATACCCACTTATTT
TTCGGGAGTATATTTATGCACTTGCTCATGATCATAGTTTAAATATAAATAATAGATCCGGTTTGTTGGA
A
And the gtf
NC_010323.1 RefSeq exon 63 137 . - . transcript_id "rna30121"; gene_id "gene20201"; gene_name "trnH-GUG";
NC_010323.1 RefSeq CDS 592 1653 . - 0 transcript_id "gene20202"; gene_id "gene20202"; gene_name "psbA";
NC_010323.1 RefSeq exon 1924 1959 . - . transcript_id "gene20203"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq exon 4534 4569 . - . transcript_id "gene20203"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq exon 1924 1958 . - . transcript_id "rna30122"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq exon 4533 4569 . - . transcript_id "rna30122"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq CDS 2266 3786 . - 0 transcript_id "gene20204"; gene_id "gene20204"; gene_name "matK";
Thanks in advance
A simple hack is to try following on gtf (take a backup of existing gtf) and use it to annotate. Issue is that a gene can have multiple transcripts and one cannot simply replace a transcript with gene name as this would result in duplicate headers. Instead append gene name to transcript ID.
Output fasta will have transcriptID_geneID as headers. If OP gtf has same transcript ID and gene ID then, try following on gtf: