Question

Tophat: transcriptome building from GTF

0

Entering edit mode

11.0 years ago

Bharat Iyengar ▴ 330

When I am using tophat to just build transcriptome from GTF file using

tophat --bowtie1 -G annotations.gtf --transcriptome-index=transcriptome genome

does tophat selectively pick up the lines corresponding to feature=transcript from a full GTF

or

is it necessary to create a new GTF for only the transcripts?

Tophat • 8.5k views

ADD COMMENT • link updated 9.4 years ago by super ▴ 60 • written 11.0 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

GTF files don't normally have a transcript feature, just exon, CDS, start_codon, and stop_codon. Are you sure you're not talking about a GFF file?

ADD REPLY • link 11.0 years ago by Devon Ryan 105k

0

Entering edit mode

Atleast GENCODE GTFs have it. The one that I am using right now is from ENSEMBL. It also has the following features:

CDS, start_codon, transcript, gene, exon, Selenocysteine, UTR, stop_codon

BTW I thought GFF and GTF are synonymous because it is always written GFF/GTF.

ADD REPLY • link 11.0 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

9.4 years ago

super ▴ 60

Not more.

My code and results:

$ head genes.gtf

29       protein_coding  exon    28820825   28834944   .       +       .       exon_id "ENSECAE00000000001"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000099999"; gene_name "HELLO"; gene_source "ensembl"; p_id "P99999"; transcript_id "ENSECAT00000099999"; transcript_name "HELLO-001"; transcript_source "ensembl"; tss_id "TSS9999";
1       protein_coding  UTR     11193   11209   .       +       .       gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1       protein_coding  exon    11193   11261   .       +       .       exon_id "ENSECAE00000079002"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1       protein_coding  transcript      11193   15975   .       +       .       gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";

$ htseq-count -s yes aligned_sorted.sam genes.gtf>gene_count.txt

Error occured when processing GFF file (line 1 of file genes.gtf):
  need more than 1 value to unpack
  [Exception type: ValueError, raised in __init__.py:207]
Error occured when processing GFF file (line 1 of file genes.gtf):
  need more than 1 value to unpack
  [Exception type: ValueError, raised in __init__.py:207]

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by super ▴ 60

0

Entering edit mode

No clue then, perhaps you'd have better luck with featureCounts.

ADD REPLY • link 9.4 years ago by Devon Ryan 105k

0

Entering edit mode

Hi D, Thank you all the same!

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by super ▴ 60

0

Entering edit mode

Hi Devon

I have made my own GTF file

$head OvisERV.gtf
chr1    unknown exon    1340070 1340291 .       +       .       gene_id "Bos-Zeta-II"; gene_name "Bos-Zeta-II"; transcript_id "Bos-Zeta-II"; tss_id "Bos-Zeta-II";
chr1    unknown exon    1340525 1341376 .       +       .       gene_id "Bos-Zeta-II"; gene_name "Bos-Zeta-II"; transcript_id "Bos-Zeta-II"; tss_id "Bos-Zeta-II";
chr1    unknown exon    2803531 2804298 .       -       .       gene_id "Capra-g1"; gene_name "Capra-g1"; transcript_id "Capra-g1"; tss_id "Capra-g1";

we get some result, I think it means that there are 3 reads mapped in this area. Am I right?

$samtools view aligned.bam  "1:2803531-2804298"

ATF14:09301:08500       16      1       2804042 1       21M1D72M        *       0       0       GCTGAGGCCAGTCCCTGAACAATTTACTCACAAATGCAGGTCCATTATCAGAGCCTATAGATGCTGGCAGCCCAAACCTAGGTATAATCTCTT        ?<<;=7=7<:::4<;;;59822+44489996+667::;5;;4;:4::>;;;><6<<;<<<?=>><6;<;;3::3::4;;<7<<>>7?;<;736   AS:i:149        XS:i:149        XN:i:0  XM:i:5  XO:i:1       XG:i:1  NM:i:6  MD:Z:17C3^G0T24G15A1C28 YT:Z:UU
ATF14:01518:01473       16      1       2804295 11      1S30M1D108M1D10M        *       0       0       CCCCCTGATATCTTACACCTGTGTGCTGTCCTTTCCCGGTCTCATCATCTGACAGGCCTTGCAGGTGCGGACTATATCCTGGATGGTTTTCTTCTGTTGGGGAAACCACAGCTGCGCTGTTTGTAGGAGTGTCAAAGTCTTTTCTCTCC        -<<<==<<@==>7@@?<6<=<<;566<<:55,66,;;6<<989;;;;;;;::::5<5=6=<=<6<<<<6@@==<;;<7:;598942*6664188994;0>>>4>=6<<<<;<<;=<<<<3<<<>=7<:78;99199::6+6666:982;        AS:i:250        XS:i:240        XN:i:0  XM:i:5  XO:i:2  XG:i:2  NM:i:7  MD:Z:4C15C9^T20G63A1C21^T10 YT:Z:UU
ATF14:04367:02811       16      1       2804295 11      174M    *       0       0       CCCCTGATATCTTACACCTGTGTGCTGTCCTCTTCCCGGTCTCATCATCTGACAGGCCTTGCAGGTGCGGACTATATCCTGGATGGTTTTCTTCTGTTGGGGAAACCACAGCTGCGCTGTTTGTAGGAGTGTCAAAGTCTTTTTCTCTCCCAAATGGGTGGTTTGATGTAAATG       -66888::<;;5<<<<6>><<<;?=<:4.448462;:6888877:99::9776859394;:<A7;9<;7;<<<::872<96636.:.::;85877738.:<6085188::97:::93370:::;;5:9689:9289462*6662155/*/2*778/99;5:1:;;<9:93<?::       AS:i:307        XS:i:296        XN:i:0  XM:i:7  XO:i:0  XG:i:0  NM:i:7       MD:Z:4C15C10T19G63A1C42T13      YT:Z:UU

But why I get noting mapped when I ran htseq-count? aligned.bam is the BAM file which represents the sample mapped to sheep genome by Tophat.

Thank you!

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by super ▴ 60

0

Entering edit mode

One file has chromosomes named "1", the other "chr1". featureCounts will typically deal with that, but htseq-count won't.

Aside from that, whether the first read is included will depend on the MAPQ filtering settings. Further, have a look at the strand settings and compare that to the strand of the feature.

ADD REPLY • link 9.3 years ago by Devon Ryan 105k

Ram · Accepted Answer · 2014-05-27

3

Entering edit mode

11.0 years ago

Luyi Tian ▴ 120

I read the GTF parsing code in Tophat before. It reads exon , cds and transcript info. they add a lot of statements to maintain strong compatibility to different file format and deal with special cases, which made it not a enjoyable experience to read the source code. :P

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 11.0 years ago by Luyi Tian ▴ 120

0

Entering edit mode

I've moved this to an answer since it's much more informative than mine, given that you've actually slogged through the source code! Thanks doing so and mentioning what's actually in there!

ADD REPLY • link 11.0 years ago by Devon Ryan 105k

Ram · Accepted Answer · 2014-05-26

1

Entering edit mode

11.0 years ago

Devon Ryan 105k

Ah, the ones from UCSC and Ensembl that I have locally don't have those, though one can add pretty much anything in that field. No, GFF and GTF aren't the same (they're similar). GFF, in fact, has multiple variants itself. How tophat parses the annotation file isn't documented anywhere to my knowledge, so unless you want to bother going through the source code, just run the gtf_to_fasta command (that's how it creates the transcriptome reference) and have a look at it. At least in the case of the annotation files I've used, it looks at exon features.

ADD COMMENT • link 11.0 years ago by Devon Ryan 105k

0

Entering edit mode

i'll have a look. I can try removing the lines with certain features(transcript/exon) and see if it still builds it. But I too guess that it parses the exons. And since it worked for your GTF with no feature called transcript, it surely picks up exons only. Would you convert your comment to an answer so that I can accept it?

ADD REPLY • link 11.0 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

Sure, done. BTW, please post back if you do play around with things and let us know what happens with the transcript features. It's always nice when one can at least piece together some documentation with google!

ADD REPLY • link 11.0 years ago by Devon Ryan 105k

1

Entering edit mode

I removed the lines corresponding to exon and transcript and compared the fasta generated from original GTF with the reduced GTF. The files are same; It still seems to work!! Can't understand how!! Perhaps it is parsing the CDS and UTR features as an alternative. Would need to look at source code (haven't worked with c++ in a long time :P)

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 11.0 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

Ah the joys of undocumented software. At least it's open source :)

ADD REPLY • link 11.0 years ago by Devon Ryan 105k

0

Entering edit mode

Hi D，

If I have finished the Tophat step (get the accpeted_hits.bam), and I have looked at the corresponding area by the command samtools view input.bam "Chr1:10000-20000" and I found there are some reads at this area. Then if I want to look at the expression some part of genome (we called it pseudo gene, however, this pseudo gene could be real gene, but which is not annotated by Reference genome e.g. Ensembl), how could I do it?

Could I directly add this pseudo gene to the GTF/GFF file and then execute the Cuffdiff/edgeR? or I need to re-do the mapping step based on new GTF/GFF file?
Which feature I need pay attention? I add 'exon', 'CDS' and 'transcript', is that all?

Thank you!

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by super ▴ 60

0

Entering edit mode

It depends on whether your region of interest is similar to an annotated gene. If it is, then tophat2 would show fewer reads aligning than reality. If your region has low similarity then just rerun cufflinks/cuffdiff and/or featureCounts/edgeR.
If you're using a GTF file then only 'exon' matters. For a GFF you'll need 'gene' and 'transcript' too.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Devon Ryan 105k

0

Entering edit mode

Thank you D.

My steps:

STEP 1 Tophat :Align sample1.fq sample2.fq ... sample6.fq to Horse reference genome (Ensembl) and get the corresponding accpeted_hits.bam file;

STEP 2 Samtools: Look at the Chr21:100000-200000 of bam file and found there are some reads at the area;

STEP 3 Add the Chr21:100000-200000 as 'exon' (a pseudo gene) to the genes.gtf of Ensembl, rename the GTF file

$ head genes_hello.gtf
21 protein_coding exon 100000 200000 . + . exon_id "ENSECAE00000000001"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000099999"; gene_name "HELLO"; gene_source "ensembl"; p_id "P99999"; transcript_id "ENSECAT00000099999"; transcript_name "HELLO-001"; transcript_source "ensembl"; tss_id "TSS9999";
1 protein_coding UTR 11193 11209 . + . gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
1 protein_coding exon 11193 11261 . + . exon_id "ENSECAE00000079002"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013"
......

STEP 4 Then run Cuffdiff directly based on this genes_hello.gtf

But the Cuffdiff output (DE result) didn't contain any my pseudo gene information.

Why? What's wrong?

Is that because I haven't made a proper GTF file format?

Thank you!

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by super ▴ 60

0

Entering edit mode

Hard to say, cuffdiff is a bit of a black box. What happens with featureCounts?

ADD REPLY • link 9.5 years ago by Devon Ryan 105k

0

Entering edit mode

I use htseq-count and get the error

need more than 1 value to unpack [Exception type: ValueError, raised in __init__.py:207]'

I think it may due to the improper annotation format.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by super ▴ 60

0

Entering edit mode

There wasn't any more to the error message than that?

ADD REPLY • link 9.4 years ago by Devon Ryan 105k