Using galaxy tool htseq_count for lncRNA in wheat reference genome
1
0
Entering edit mode
5.0 years ago
o.delaney • 0

lncRNA GFF file however we have encountered several difficulties. Specifically, there are many lines for what appears to be the same coding sequence, for instance the first nine rows all start at position 61723. This means that htseq_count does not know which of these 9 rows to match a particular read with. Furthermore, in the group column each entry starts with ID=STRG... rather than the traditional gene_id=... which also confounds our approach by making the htseq_count unable to recognise which lines in the GFF file are all actually just one feature.

How do I circumvent these problems - are there other tools on galaxy I should use first to clean the GFF file (see image), or do I need to use special settings or some other trick? Let me know if there is any other information you need or if I should share my galaxy history with you to clarify things. screenshot from galaxy of GFF file

RNA-Seq alignment • 1.3k views
ADD COMMENT
0
Entering edit mode
4.8 years ago

Thats what GFF3 files are structured like:

eg

geneA / path1 in your example
|
--transcript1 / mrna1 in your example, parent geneA
     |
    ---exon1, parent, transcript1
--transcript2 / mrna1 in your example
     |
    ---exon1, parent transcript2

If you're not happy, try

a) featureCounts 
b) just taking gene features to start playing (use grep etc) 
c) try mapping transcripts yourself, using gmap or Maker.

It's not perfect or easy, but hey, welcome to bioinformatics.

ADD COMMENT

Login before adding your answer.

Traffic: 1366 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6