Hi,
I have a GFF file containing MCF-7 cell transcript data. There is also a fastq file if that can be helpful.
chr1 PacBio transcript 27567 29338 . - . gene_id "PB2015.1"; transcript_id "PB2015.1.1"; chr1 PacBio exon 27567 29338 . - . gene_id "PB2015.1"; transcript_id "PB2015.1.1";
These are the first two lines.
You can see that it identifies both transcripts and exons. I want to filter out any transcripts containing less than two exons for each transcript.
Additional Info:
I have attempted doing this with UCSC Table Browser and Galaxy. They both end up throwing errors.
Galaxy Error: An error occurred with this dataset: Traceback (most recent call last): File "/cvmfs/main.galaxyproject.org/galaxy/tools/filters/gff/gff_filter_by_feature_count.py", line 182, in <module> __main__() File "/cvmfs/main.galaxyproject.org/galaxy/tools/filters/gff/gff_filter_by_feature_c
File "/cvmfs/main.galaxyproject.org/galaxy/lib/galaxy/datatypes/util/gff_util.py", line 191, in __next__ self.seed_interval = GenomicIntervalReader.next(self) RuntimeError: maximum recursion depth exceeded while calling a Python object
Filter 18: MCF7 hg19.gff
Using feature name exon
With following condition >1
Table browser doesn't list exon as a possible filter option when I upload this dataset as a custom track.
I am very new to this. Does anyone have any suggestions for me here? I can use R pretty fluently and I have a little bit of python ability. I also have bedtools set up but I don't know how to use it very well.
Please point me in the right direction!
Thanks, Alex.
will yield a list of the ids of all multi-exon transcripts. Is that what you're looking for? (I noticed the line format you provided above is gtf not gff, btw.)
I just typed out a nice comment and accidentally pressed cancel instead of add comment. Lost everything. Super pissed.
Summary: Unfortunately your script identified all transcripts as multi exon. Thanks for your efforts. I think I can figure this out in R now though.
Thanks for pointing out this is GFFv2/GTF format. Now I am able to import and export win R without corruption.
here is some toy data. 2 transcripts to keep and 2 to eliminate. In case you can figure out what went wrong in your script. I will update if I can get this sorted in R. Galaxy still errors out even though I said it was GTF format.
Hmm, when I run this on your toy data, it returns only the three real multi-exon transcripts, i.e. PB2015.3.1, PB2015.4.1, PB2015.4.2.
Sorry I wasn't able to make your script work. I can take another look at it in a bit. But I still have good news! I managed to sort out the issue in R! verified elimination of single exon transcripts comparing the two tracks using IGV.
This honestly wouldn't have been possible without your noticing the format.
Well, that is good that you got a solution and that's what counts. I really need to get my R sorted these days...