What is wrong with my custom GTF ??
2
0
Entering edit mode
9.4 years ago
Joel TM ▴ 60

Good day, I've been trying to make the tuxedo pipeline for RNAseq data analysis work with my custom GTF but it haven't succeeded yet. I have tried many things, read many many posts...still nothing. I want to see differential expression in lncRNA. Some of them are known, some are unknown. I've had success with HTSeq --> DESeq, EdgeR...but I want to corroborate my results with cuffdiff.

GTF with known lncRNA only looks like that, (it should work but doesn't):

chr1   unknown   exon         12567300   12567451   0   +   0   gene_id "SNORA59A"; gene_name "SNORA59A"; transcript_id "NR_003025_1"; tss_id "TSS3103";
chr1   unknown   exon         24171572   24172345   0   -   0   gene_id "FUCA1"; gene_name "FUCA1"; p_id "P16785"; transcript_id "NM_000147"; tss_id "TSS3487";
chr1   unknown   stop_codon   24172205   24172207   0   -   0   gene_id "FUCA1"; gene_name "FUCA1"; p_id "P16785"; transcript_id "NM_000147"; tss_id "TSS3487";
chr1   unknown   CDS          24172208   24172345   0   -   0   gene_id "FUCA1"; gene_name "FUCA1"; p_id "P16785"; transcript_id "NM_000147"; tss_id "TSS3487";
..
..
..

But tophat returns me with:

format:          fastq
        quality scale:   phred33 (default)
[2015-07-07 08:59:30] Reading known junctions from GTF file
        Warning: TopHat did not find any junctions in GTF file
[2015-07-07 08:59:30] Preparing reads
         left reads: min. length=20, max. length=50, 28359825 kept reads (14 discarded)
        right reads: min. length=20, max. length=50, 28359834 kept reads (5 discarded)
[2015-07-07 09:05:45] Building transcriptome data files..
[2015-07-07 09:06:01] Building Bowtie index from lncRNA_toUse.fa
        [FAILED]
Error: Couldn't build bowtie index with err = 1

My first goal is to fix this. Then I will have to try with my unknown lncRNA as well, which the only format I get is:

chr1    unknown    exon    47562325    47644943    0    +    0    gene_id "CYP4A22-AS1"; gene_name "CYP4A22-AS1";

It is worth noting that with the "official" Genes.gtf, the pipeline works. I am puzzled, and thank you very much for any insight.

Have a great day,

J.

RNAseq GTF • 6.3k views
ADD COMMENT
0
Entering edit mode

Just as a total guess, does it have something to do with lncRNA_toUse.fa? (as in, does it use the same labels as the gtf file)

Might be the same problem as this: https://biostar.usegalaxy.org/p/10046/

ADD REPLY
0
Entering edit mode

Tophat creates a log file and stores the exact commands that it uses. Try to run the last one and you'll (presumably) get the actual error that bowtie2 produces.

ADD REPLY
2
Entering edit mode
9.4 years ago
Joel TM ▴ 60

I thank you both for the time you took reading my issue. As it so happens often with bioinformatics, I simply re-ran my original extracting command providing me with a virtually identical .gtf file, yet, it is now working. I was looking so deep into the issue that I didn't think of the very, very simple things.

I wish you the best !

J.

ADD COMMENT
2
Entering edit mode

FWIW this is called "programming by coincidence" and, at least in principle, it is a bad idea to continue on without understanding what went wrong in the first place.

On the other hand as you say this is bioinformatics - so many things go haywire all the time - there is rarely time to investigate every issue

ADD REPLY
0
Entering edit mode

Well My issue is only partially resolved. As soon as I add data for the lncRNAs I am interested in but are absent from the reference gtf, it crashes even if that data is from the UCSC Genome Browser table.... I will keep updating this post but there's gotta be a way to make this work with the tuxedo pipeline...otherwise it's just not very flexible...

ADD REPLY
0
Entering edit mode

Again, look in the logs for the commands being run and run the last one there manually. You'll then likely get a more informative error message.

ADD REPLY
0
Entering edit mode

I wasn't aware of this "programming by coincidence" but now I realize that is a thing a saw a few times ago.

ADD REPLY
0
Entering edit mode

I would not go as far as saying I do bioinformatics in blind luck; having said that, I only started working full time in the field this year. I am still filling my toolbox, and the learning process is never ending but oh so rewarding. I come here because of the community. I hope that seeking help in the form of an open question on Biostars, when I have scoured the web and forums without finding my answer, isn't frowned upon. In the end, my problem was ridicule, but honest. And now I know. It might even serve someone else one day.

ADD REPLY
1
Entering edit mode
9.4 years ago
Joel TM ▴ 60

[FINAL SOLUTION]:

I think all is well now. Here is what I did.

  1. I used " sed " to extract the referenced functional lncRNAs from the genes.gtf
  2. I utilized lncRNAdb + UCSC Table Browser to extract the not so well documented functional lncRNAs I am interested in, making sure the output was in the GTF format.
  3. I created independent .txt files for each of those lncRNA. I added for each of those the field gene_name "xxxxx"; (time consuming)
  4. I used "cat" to merge everything back together and voila. Basically I did everything from the command line instead of adding a quick line here and there with my .gtf open.

I now have 0 warning, and no problem. I work exclusively in linux and never encountered this issue before (it was just a matter of time I am sure). May be it had something to do with encoding or end of line character...? I do not know.

Thanks to all again, it was a rare scenario where the .log didn't help me ...

have a great day,

J.

ADD COMMENT
0
Entering edit mode

Hi Joel,

I am wondering if it would be possible for you to run me through the steps in making a custom GTF file. I have RNASeq data for Mouse neurons with GFP expression (within a gene). I have GFP sequence and insertion sites where this was incorporated in the gene. I am trying to incorporate this sequence in the reference genome to quantify the expression of this GFP incorporated gene. I believe after I insert this sequence in the reference genome (either by inserting it in the gene of interest or copying the whole gene with inserted GFP as new chromosome), I will have to update the GTF file somehow. What do you think is a best approach to get it to work? Can you help me with this and share your experience?

Thanks,
Muhammad

ADD REPLY
0
Entering edit mode

[N.B., I'm obviously not Joel]

The simplest method would be to add the GFP sequence to the genomic fasta file and add a single line to the GTF file like:

GFP     ensembl exon    0       2000    .       +       .       gene_id "GFP"; transcript_id "GFP"; exon_number "1"; gene_name "GFP"; transcript_name "GFP";

You'll need to change the name to match whatever you use and also the size (I used to know the length of EGFP, but I can't recall it at the moment). If you would prefer to include the entire fusion gene sequence then you can also do that, but you then must hard mask the sequence for the original gene in the genome (otherwise, most of the reads you're interested will be multimappers and will then get ignored!). For that you can use "bedtools maskfasta". This method would probably produce the best results.

ADD REPLY
0
Entering edit mode

Hi Devon,

Thanks so much for your prompt response. I was waiting for more information from my collaborator before I tried this. As you suggested I hard-masked the original gene. Should I add entire fusion gene sequence as new chromosome? or at the end of same chromosome (chr6)? I have never worked on such projects before therefore I am confused. Since I have hard-masked the original gene, should I edit/delete the annotation in GTF for chr6? and edit it to be a new chromosome? lets say "gfp" or "24"??

I also have some questions about the reporter sequence. I found out from the paper that there are other elements flanking the reporter gene (ZsGreen) such as (Rosa26sor + CAG-prm + loxSTOPlox + ZsGreen + WPRE + etc) from paper "A robust and high-throughput Cre reporting and characterization system for the whole mouse brain 10.1038/nn.2467" and my original gene is Rosa26sor. I have ZsGreen and WPRE sequence only and from the paper I know that reporter gene was inserted between exon 1 and 2 but didn't mention insertion-nucleotide positions. After contacting the Author, she sent me Genbank file of whole vector again, I dont know the positions. What do you suggest I do about the exact positions for the reporter gene? I would love to email you the details and all these sequences. Let me email you the files. I really appreciate your inputs and support.

ADD REPLY
0
Entering edit mode

Hey Devon,

I have a similar problem as Mnoon. When I try including the entire fusion gene sequence, not only the reads that I'm interested in but a lot of other reads as well end up being multimappers. Could you explain what you mean by hard masking the sequence? Should I be removing the entries for the original gene in the gtf file? Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6