Hello,
I'm trying to find overlapping genes for my CNV calls. I downloaded the gene annotations (hg18 (Mar2006, NCBI build 36)) from UCSC:
[knownGene.txt.gz]
[kgXref.txt.gz]
and the same for refGene annotation explained on PennCNV website.
But when I run the 'scan_region.pl' command an error occurs:
C:\penncnv>scan_region.pl sample.rawcnv hg18_refGene.txt -refgene -reflink hg18_refLink.txt > sample.cnv.rg18
Error: invalid record in template-location-file hg18_refGene.txt (expecting 16 or 10 tab-delimited fields in refGene file): <1410,2804,5917067 N525,1506,525,15824069132,140691R_02,, 873 7974, 215506,5254,2,1,, 218281,,
238422,, 23-1,6,525,05784525,1392, 6913282406918345,, 87372251586LIS995, 37974586,1,- 85544155 8, 0 CEP68,2,30,15,2061314048,88390,33,0,21066480,21066480,21066480488390,33,,,291384439717 -8,,2106335,883909781,,2913XR1 9717 4695,210664805392OC1924750493576081593549121593>
at C:\penncnv\scan_region.pl line 540 main::scanUCSCGene('sample.rawcnv', 'hg18_refGene.txt', 0, 'refgene', undef, undef) called at C:\penncnv\scan_region.pl line 108
Something seems to be broken in the annotation file. How can I avoid or fix this? I'm a biologist, not a computer scientist, so please be kind.;)
Thank you
Can you show how hg18_refGene.txt looks?
It's a tab-delimited txt file. When I open it in excel there are 16. columns. But from line 900 the format seems to be destroyed. Therefore I think I found the problem suspecting the extraction of the .gz archive!?
Update: Yes, extraction problems with powerarchiver. Using winrar let it works!
You should put that in the answer and then accept it, in case someone else has the same problem.