Hi all,
I'm having an issue with duplicate gene names/IDs which are interfering with parsing my count matrix to edgeR.
I have a GFF3 file which I converted to GTF using:
gffread my.gff3 -T -o my.gtf
I am able to produce a count matrix using this gtf file thorugh HTSeq. However, when I try to use edgeR to read the file (I have tried both in command line and using the website Galaxy) it cannot read the file due to duplicate gene names.
This is the specific error I get in edgeR if that helps (I get the same error putting row.names=1):
> x <- read.delim("counts.txt",row.names="Contig")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
Looking at the GTF file, while there are multiple lines with the same gene name/ID/transcript ID etc., this is because they are different features of the same gene (the phase and position on the strand of each exon/part of the CDS)
Therefore it seems either I need to tell HTSeq to differentiate identical gene names based on position/feature or I need to convert GFF3->GTF in such a way that each gene name is unique/there is only one line that encompasses all the information it needs
Does anyone know which is the best way to do this (and how??) ?
(I have a feeling I will need to do this by changing some settings in HTSeq, so if anyone knows how to do this in Galaxy that would be amazing)
Many thanks,
Chloe
Hi Chloe,
There are more solutions possible I think. Best solution is to find a GFF3 or GTF file with unique identifiers, such as ensembl accession for instance.
Other (less optimal) option is to read your table into R without row.names argument. And then assign the row.names later manually from your "Contig" column (although I expect problems when you use read.table with characters and numeric mixed).