The sequencing center we work with recently started to report variants for random chromosomes, we now end up with variants (whole exome) that have been mapped to one of the chrrandom. I would first ask for a good explanation of this chrrandom issue and secondly can anyone explain how to approach the analysis when it comes to these randomchr based variants? Should I just disregard them? I fear it would be hard to find many downstream annotations for these positions in the public databases like dbsnp? I am sorry for the question is not specific but I think I basically need chrrandom and exome sequencing 101 type of an answer or please direct me to a good read on this...
to be honest I've never gone that far, since my current work end with reporting variants, but I can tell you that we haven't found (yet) variants on genes described on that contigs, surely because probes were not designed to cover them, and since we're doing exome sequencing they were quickly filtered out. nothing meaningful to date, I'm afraid. in case we find any, our intention is to process them in the same way we process the rest, although we foresee that annotation on those contigs will be very limited.
Michael's link is where you can read a little bit about those chr_random, chrUn or chr_hap, although it doesn't properly help you deciding what to do with variants found on those special contigs. without further explanation here, all I can tell is that my group has decided to disregard only chr_hap information even from the mapping step due to their exposure to natural selection (variation found on them could be spureus, lots of pseudogenes are present, ...), and to indeed consider the rest of the contigs (chr_random and chrUn) as they aren't placed on the genome just for algorithmic reasons.
Jorge can you elaborate on how do you treat a variant you identify in a chrN_random? Specifically if you want to see if this variant is found before or what is the frequency of this allele ( i.e dbsnp and or 1000 genomes) or what about deleteriousness prediction with SIFT or Ployphan? Did you ever get a nearly meaningful result from these or know a paper that argues so?
For what I know, these are the contigs of genome that are not quite sure the exact position. Because there are many factors effect the assembling of genome, so some contigs the consortium didn't integarate with whole genome, just labeled as chrUn_ (not sure which chromosome come from) or chr1__random (from chr1 already known). And for hg19, there are patches released when the consortium integerate the contig with genome (in hg19, the coordinates are reversed for contigs, so in the version hg19, the integeration patch doesn't effect the already sequence coordinates).
More information you may check this:
http://www.ncbi.nlm.nih.gov/projects...initions.shtml
For what I checked, some chrUn contigs have also some variants of rRNAs or such things. So I think you'd better exclude chr__random firstly, because the annotation is just duplication of the known annotated, so the result may be false positive, and acutally perhaps we should mapped the reads as variants to the annoatation record of reference choromosomes.
Unplaced and unlocalized contigs are not patches. And generally you would not want to integrate the real patches. Patches contain long flanking sequences identical to the reference genome. Integrating them leads to the loss of true mappings. Also, if you see rRNAs in unplaced/unlocalized contigs, GRC believes they are true alternative copies in one individual.
I am also working with NGS data analysis and although my work is more focussed on TF binding and urs is more focussed on SNP analysis.
We generally/mostly discard this Chr_rand in order to avoid any ambiguity in our data analysis.
In order to check how this Chr_rand is skewing your data just perform two analysis on same dataset one with chr_rand regions and one without and browse them to genome browser and you will see the difference.
Seeing the difference is exactly the reason why chr_random should be included. Most people do not care about the SNPs/signals in unlocalized/unplaced contigs, but we do care false SNPs/signals caused by reads coming from these contigs but wrongly mapped to chromosomal regions.
lh3 is right. including these random contigs in the pipeline sure increases the mapping time, but it definitely improves the mapping (and forthcoming) results by removing reads that could otherwise map wrongly, hence lowering the variant calling power and quality.
Do you mean this?
to be honest I've never gone that far, since my current work end with reporting variants, but I can tell you that we haven't found (yet) variants on genes described on that contigs, surely because probes were not designed to cover them, and since we're doing exome sequencing they were quickly filtered out. nothing meaningful to date, I'm afraid. in case we find any, our intention is to process them in the same way we process the rest, although we foresee that annotation on those contigs will be very limited.
Michael's link is where you can read a little bit about those chr_random, chrUn or chr_hap, although it doesn't properly help you deciding what to do with variants found on those special contigs. without further explanation here, all I can tell is that my group has decided to disregard only chr_hap information even from the mapping step due to their exposure to natural selection (variation found on them could be spureus, lots of pseudogenes are present, ...), and to indeed consider the rest of the contigs (chr_random and chrUn) as they aren't placed on the genome just for algorithmic reasons.
Jorge can you elaborate on how do you treat a variant you identify in a chrN_random? Specifically if you want to see if this variant is found before or what is the frequency of this allele ( i.e dbsnp and or 1000 genomes) or what about deleteriousness prediction with SIFT or Ployphan? Did you ever get a nearly meaningful result from these or know a paper that argues so?
That's what I felt too, thanks for sharing your experience Jorge.