Hi all
When do RNA-Seq analysis, after I have finished the alignment by Tophat and get the accepted_hits.bam
. Now I want to looked at if it is has differential expression at the corresponding area (We called it 'pseudogene' which haven't been annotated by reference genome). How could I do it?
I think the best way is to add this pseudogene to reference genome GTF file. And then I could play HTSeq-count or Cuffdiff to look at the differential expression of this gene .
Here is the effort I made so far . I have add the Chr21:100000-200000 as 'exon' (a pseudo gene) to the genes.gtf of Ensembl, rename the GTF file.
$ head genes_hello.gtf
29 protein_coding exon 28820825 28834944 . + . exon_id "ENSECAE00000000001"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000099999"; gene_name "HELLO"; gene_source "ensembl"; p_id "P99999"; transcript_id "ENSECAT00000099999"; transcript_name "HELLO-001"; transcript_source "ensembl"; tss_id "TSS9999";**
1 protein_coding UTR 11193 11209 . + . gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1 protein_coding exon 11193 11261 . + . exon_id "ENSECAE00000079002"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1 protein_coding transcript 11193 15975 . + . gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1 protein_coding CDS 11210 11261 . + 0 exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; protein_id "ENSECAP00000010287"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
However, if I run the Cuffdiff directly, the output (i.e. gene_exp.diff
) didn't contain any my pseudo gene information. And if I ran HTSeq-count, it gave me the error:
$ htseq-count -s yes aligned_sorted.sam genes_hello.gtf>gene_count.txt
Error occured when processing GFF file (line 1 of file genes.gtf):
need more than 1 value to unpack
[Exception type: ValueError, raised in __init__.py:207]
It seems that I didn't add the pseudogene properly. Does anyone know how to do it?
Thank you!
hi,
A GTF contains for each gene a hierarchy of information as: gene, transcript, exon. Probably you need to enter two more lines stating as "gene" and a "transcript" in Col 3 of GTF for your "pseudo-gene".
Cheers, I remember I have added not only exon but also CDS,UTR, sth. but I failed aw well.
So I only keep the exon in the gtf file and post this question.
But the problem must be the GTF file format.