I have Illumina mapped paired end reads (RNA seq) on human transcriptome (ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz) with tophat2 Now I have to analyze the results with HTseq. But I can't find the correponding GFF of the Homo_sapiens.GRCh38.cds.all.fa The headers of the fasta sequences seems to include all usefull informations :
>ENST00000631427.1 cds chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142362570:142363134:1 gene:ENSG00000282543.1 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:AC234635.3 description:T cell receptor beta variable 4-3 [Source:UniProtKB/Swiss-Prot;Acc:A0A589]
>ENST00000632148.1 cds chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142370836:142371348:1 gene:ENSG00000282353.1 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:AC234635.1 description:T cell receptor beta variable 6-3 [Source:UniProtKB/Swiss-Prot;Acc:P0DPF7]
>ENST00000631392.1 cds chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142374511:142375050:1 gene:ENSG00000282506.1 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRBV7-2 description:T cell receptor beta variable 7-2 [Source:HGNC Symbol;Acc:HGNC:12236]
>ENST00000455382.2 cds chromosome:GRCh38:7:142300924:142301432:1 gene:ENSG00000226660.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRBV2 description:T cell receptor beta variable 2 [Source:HGNC Symbol;Acc:HGNC:12195]
>ENST00000390387.3 cds chromosome:GRCh38:7:142308542:142309048:1 gene:ENSG00000237702.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRBV3-1 description:T cell receptor beta variable 3-1 [Source:HGNC Symbol;Acc:HGNC:12212]
>ENST00000390357.3 cds chromosome:GRCh38:7:142313184:142313666:1 gene:ENSG00000211710.3 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRBV4-1 description:T cell receptor beta variable 4-1 [Source:HGNC Symbol;Acc:HGNC:12215]
Any advices on how to parse these fasta headers to buikd the GFF ? Existing tool ?
Thanks !
That sounds like a very cumbersome approach. Why not simply look for the correct GFF file, that must be available for human. You might not have a GFF file for (only) the CDS though, but you don't need that, one that covers everything on the genome will do fine
Actually, I have serched, but If I look the Homo_sapiens.GRCh38.96.chr.gff3 some scaffold identifiers are missing : for example "CHR_HSCHR7_2_CTG6" If I look the Homo_sapiens.GRCh38.96.abinitio.gff3 thses identifiers are present but the CDS identifiers are diffrent: For example "GENSCAN00000017672" instead of "ENST00000631427"
:/
Human is not really my field of research but have you looked here as well? :
https://www.gencodegenes.org/human/
Thanks i.sudbery and Devon Ryan. :) I thinks I'm going to start over the mapping on the primary assembly of the human genome. And use STAR or Salmon
Use
salmon
followed bytximport
. Computationally inexpensive while using current GC bias correction models and accounting for different transcript lengths when aggregating tx counts to the gene level.use STAR for genome and Salmon for transcriptome and not mix them up ;)