Question

Using galaxy tool htseq_count for lncRNA in wheat reference genome

0

Entering edit mode

5.0 years ago

o.delaney • 0

lncRNA GFF file however we have encountered several difficulties. Specifically, there are many lines for what appears to be the same coding sequence, for instance the first nine rows all start at position 61723. This means that htseq_count does not know which of these 9 rows to match a particular read with. Furthermore, in the group column each entry starts with ID=STRG... rather than the traditional gene_id=... which also confounds our approach by making the htseq_count unable to recognise which lines in the GFF file are all actually just one feature.

How do I circumvent these problems - are there other tools on galaxy I should use first to clean the GFF file (see image), or do I need to use special settings or some other trick? Let me know if there is any other information you need or if I should share my galaxy history with you to clarify things. screenshot from galaxy of GFF file

RNA-Seq alignment • 1.3k views

ADD COMMENT • link updated 4.8 years ago by colindaven 7.0k • written 5.0 years ago by o.delaney • 0

score 0 · Answer 1 · 2020-03-17

Thats what GFF3 files are structured like:

eg

geneA / path1 in your example
|
--transcript1 / mrna1 in your example, parent geneA
     |
    ---exon1, parent, transcript1
--transcript2 / mrna1 in your example
     |
    ---exon1, parent transcript2

If you're not happy, try

a) featureCounts 
b) just taking gene features to start playing (use grep etc) 
c) try mapping transcripts yourself, using gmap or Maker.

It's not perfect or easy, but hey, welcome to bioinformatics.