Entering edit mode
6.7 years ago
Venados
▴
30
Hello all!
I'd like to get a fasta entry per known transcript in the GENCODE GTF v27 (ensembl 90).
Is there a simple way to do this or a repository where I can find this fasta file?
Or do I need to write a script to extract and concatenate all the exon sequences for every transcript?
Thanks a lot in advance!
I just downloaded the hg38_CDS_all.fasta which should contain the data I want as it doesn't contain the intronic sequence.
However, there are many entries that are too short to be transcripts, any idea why this happens? Thanks in advance!:
>ENST00000434970.2 cds chromosome:GRCh38:14:22439007:22439015:1 gene:ENSG00000237235.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD2 description:T-cell receptor delta diversity 2 [Source:HGNC Symbol;Acc:HGNC:12255]
CCTTCCTAC
>ENST00000448914.1 cds chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T-cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000415118.1 cds chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T-cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]
GAAATAGT
>ENST00000631435.1 cds chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1
GGGACAGGGGGC
>ENST00000632684.1 cds chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1
GGGACAGGGGGC
>ENST00000454908.1 cds chromosome:GRCh38:14:105919502:105919518:-1 gene:ENSG00000236170.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD1-1 description:immunoglobulin heavy diversity 1-1 [Source:HGNC Symbol;Acc:HGNC:5482]
GGTACAACTGGAACGAC
>ENST00000390567.1 cds chromosome:GRCh38:14:105881034:105881053:-1 gene:ENSG00000211907.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD1-26 description:immunoglobulin heavy diversity 1-26 [Source:HGNC Symbol;Acc:HGNC:5485]
GGTATAGTGGGAGCTACTAC
>ENST00000603326.1 cds chromosome:GRCh38:15:20004797:20004815:-1 gene:ENSG00000271317.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD4OR15-4A description:immunoglobulin heavy diversity 4/OR15-4A (non-functional) [Source:HGNC Symbol;Acc:HGNC:5506]
TGACTATGGTGCTAACTAC
>ENST00000414852.1 cds chromosome:GRCh38:14:105913222:105913237:-1 gene:ENSG00000233655.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD4-4 description:immunoglobulin heavy diversity 4-4 [Source:HGNC Symbol;Acc:HGNC:5505]
TGACTACAGTAACTAC
>ENST00000454691.1 cds chromosome:GRCh38:14:105910410:105910427:-1 gene:ENSG00000228131.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD6-6 description:immunoglobulin heavy diversity 6-6 [Source:HGNC Symbol;Acc:HGNC:5517]
GAGTATAGCAGCTCGTCC