I have a master.list.gtf (generated from Cufflinks on RNAseq data) that I wish to convert to .fasta. So far I have tried using the gffread function in Cufflinks and the getfasta function from bedtools:
# gffread command
./gffread path/to/master.list.gtf -g /path/to/GRCh38.p13.fna -w ./master.list.fasta
# bedtools command
bedtools getfasta -fi path/to/GRCh38.p13.fna -bed /path/to/master.list.gtf
However when I ran these commands I get the error:
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
Presumably this is because the chromosome IDs in the .fna file and .gtf don't match up. The .fasta reference file I am using begins like this:
CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
whereas my gtf file is formatted like so:
#!genome-build GRCh38.p2
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.17
#!genebuild-last-updated 2015-01
chr1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
chr1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
How can I edit my gtf file so that the chromosome reference are the same?
Or you could get the reference from Ensembl so it matches the GTF?
i managed to get the ref genome used to generate the gtf file and converted the gtf to fasta successfully. thanks!
Please paste the exact files that you have used (for other users) - thanks. I will then move this to an answer. Please also paste the commands that you used - again, thanks.
I managed to get the ref genome from the graduate student who generated the gtf file. However I believe you can also download the fasta reference (release 79) from here: ftp://ftp.ensembl.org/pub/release-79/fasta/homo_sapiens/dna. I had problems connecting to the ftp server due to issues with my school wifi.
The command I used was
./gffread path/to/master.list.gtf -g /path/to/GRCh38.p13.fna -w ./master.list.fasta
An index of the reference genome will be made if there isn't one already.
Why don't you simply download the matching fasta file for your GTF, which should be this one from GENCODE:
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_22/GRCh38.primary_assembly.genome.fa.gz
The .gtf file was previously generated by a PhD student running Cufflinks on fasta files of RNAseq data from my lab. He has told me that the .gtf contains novel transcripts (transcripts with no 'transcript biotype' annotation on ensembl). I assume that the fasta sequences for these novel transcripts cannot be found in the matching fasta file on GENCODE?