Entering edit mode
8.7 years ago
Michelle M.
▴
70
Hi there,
So I'm using an Ensembl gtf file (GrCh37) for rna-seq analysis and am wondering about the patches.
I know what the annotation patches are and why they're there, but should I exclude them when generating my count matrix in HTseq or Cufflinks? i.e. if I left them in, won't I get multi-reads mapping to both the patch and the original region, thereby screwing the true counts?
Thanks for your input, much appreciated.
Cheers,
M
I went through a similar conundrum. While I am not exactly answering your question, I can share this with you: I have pretty heavy libraries and couldn't believe how long the calculations were taking. So I will be removing the patches and restart the analysis; feeling more comfortable about this decision since I came across (this morning) a line from the STAR aligner manual: "Generally, patches and alternative haplotypes should not be included in the genome", suggesting to only use the primary assembly.
https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf (page 5)
You do bring a valid point though. And I would be very curious to see the appropriate answer.
Thanks Joel, that helps a lot. I'll be interested to see if anyone can confirm this, but in the meantime I think I'll be removing the patches from the file.
I just came across this, which was helpful: http://seqanswers.com/forums/archive/index.php/t-4459.html
Cheers