DEXSeq with cufflinks gtf file
2
0
Entering edit mode
10.1 years ago
Assa Yeroslaviz ★ 1.9k

Hi,

I would like to run DEXSeq with the results we are getting from cufflinks. I would like to use for that the gtf file created after running cuffmerge.

The problem I have is, that in this gtf file there are some exons with no strand information.

cut -f7 cufflinksAnalysis/cuffmerge/merged.gtf | sort | uniq -c
 126224 -
   3105 .
 124468 +

Interestingly, these genes have only one exon. None of them have more than that.

awk '{if($7=="."){print $0}}' cufflinksAnalysis/cuffmerge/merged.gtf | grep -v 'exon_number "1"' | wc -l
0
awk '{if($7=="+"){print $0}}' cufflinksAnalysis/cuffmerge/merged.gtf| grep -v 'exon_number "1"' | wc -l
104447

As these genes have only one exon, there can't be any differentially exon usage or alternative splicing events.

I know I can modify the dexseq_prepare_annotation.py script from the DEXSeq package, so that it would allow to work with non-stranded information, but I am not sure if these is a good idea, as we are also looking into some genes which have a certain overlap on the genomic sequence and without the strand information we will loose this genes/exons.

awk '{if($7!="."){print $0}}' cufflinksAnalysis/cuffmerge/merged.gtf > cufflinks_Dmel_DEXSeq_strandInfoIncluded.gff

My question is - can I just extract the exons without the strand information (the ones with ".") from the merged.gtf file and than run it with the DEXSeq package/scripts, or will it alter the results in any way?

Thanks
Assa

cuffmerge DEXSeq cufflinks gtf • 3.3k views
ADD COMMENT
5
Entering edit mode
10.1 years ago
komal.rathi ★ 4.1k

You can safely remove "any" single-exonic genes because you won't get differential exon usage for them anyway.

ADD COMMENT
0
Entering edit mode

I've moved this to an answer, because it nicely puts things back into perspective :)

ADD REPLY
1
Entering edit mode
10.1 years ago
Sam ★ 4.8k

If I remember correctly, the DEXSeq flow are as follow:

  1. Provide a GTF file and then break it down to exons units
  2. Count the number of reads per exome (DEXSeq should be using HTSeq as the counting algorithm)
  3. Run DEXSeq

If you extract the exons without the strain information and then merge them back together, some of the reads which will originally classified as ambiguous case might be counted twice. So that might have a slight impact to the results. However, that shouldn't really be a big problem as long as you are only trying to compare between each exons in difference conditions instead of comparing the expression of different exons. Maybe someone else have better idea as to how to deal with the data.

ADD COMMENT
0
Entering edit mode

The problem is mostly that some of the DEXseq methods (e.g. for visualization) need a strand to work. Since single-exon genes can't show differential exon usage (by definition), it makes sense to just do what komal.rathi suggested and just exclude such genes.

As an aside, these tend to just be repeat regions anyway (at least in my experience), so for me they've always just constituted noise.

ADD REPLY
0
Entering edit mode

And in another perspective, the multiple testing problem is already huge when working with per exone situation so it might not worth the effort to retain that many regions anyway?

ADD REPLY
0
Entering edit mode

thanks for all the help

ADD REPLY
0
Entering edit mode

I don't know if it's in the released version yet or not, but since DEXSeq is using DESeq2 internally now, there should be a switch at some point to doing independent filtering before producing the results (I though Alejandro Reyes mentioned that somewhere, though I'd need to look to be sure). This would remove the issue of "too many regions". Granted, on could just do this one's self...

ADD REPLY

Login before adding your answer.

Traffic: 1548 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6