I am going to be working with a lot of Illumina next generation sequencing data, but I will be doing something I suspect is rather unusual: Within the data, there will be a vast majority of sequences that will map fairly well to my reference genome. However, those are not the sequences I am interested in: I want to retrieve all the reads that do not map well to my reference genome.
Update: To clarify, within my dataset I have a vast majority of the reads that will match to my reference genome. However, a small portion of the dataset does (should) not match at all (i.e. I also exclude SNPs, short indels, etc.). I want to reduce my dataset to those reads that do not correspond to the conserved parts of the genome in order to do de novo assembly on them, without the overhead of assembling the whole genome from scratch. End of Update.
- Is this something commonly done?
- Are the usual tools capable of doing that for me or will I need to implement something myself?
- Any suggestions on the best method(s) to proceed?
Just to be sure, when you bitwise AND with "4", you'll get those reads that are unmapped. If you are interested in reads that
do not map WELL to your reference
, they could also mean reads that areNOT uniquely mapped
. Usually, softwares indicate this with an optional flag. In case of BWA, it would beXT:A:
and it would beXT:A:U
in case of uniquely mapped and you would be interested in all other values for that. In case of tophat/bowtie, it would beNH:i:
optional flag, whereNH:i:1
denotes a read with exactly 1 hit. So you would be interested in all other numbers forNH:i:
.Thanks for the complement of information. I'll clarify in the question what I'm looking for!