Question

Ngs: Filtering Out Reads That Have A Hit...

2

Entering edit mode

12.6 years ago

Nicojo ★ 1.1k

I am going to be working with a lot of Illumina next generation sequencing data, but I will be doing something I suspect is rather unusual: Within the data, there will be a vast majority of sequences that will map fairly well to my reference genome. However, those are not the sequences I am interested in: I want to retrieve all the reads that do not map well to my reference genome.

Update: To clarify, within my dataset I have a vast majority of the reads that will match to my reference genome. However, a small portion of the dataset does (should) not match at all (i.e. I also exclude SNPs, short indels, etc.). I want to reduce my dataset to those reads that do not correspond to the conserved parts of the genome in order to do de novo assembly on them, without the overhead of assembling the whole genome from scratch. End of Update.

Is this something commonly done?
Are the usual tools capable of doing that for me or will I need to implement something myself?
Any suggestions on the best method(s) to proceed?

next-gen • 3.0k views

ADD COMMENT • link updated 12.6 years ago by Ashutosh Pandey 12k • written 12.6 years ago by Nicojo ★ 1.1k

1

Entering edit mode

Just to be sure, when you bitwise AND with "4", you'll get those reads that are unmapped. If you are interested in reads that do not map WELL to your reference, they could also mean reads that are NOT uniquely mapped. Usually, softwares indicate this with an optional flag. In case of BWA, it would be XT:A: and it would be XT:A:U in case of uniquely mapped and you would be interested in all other values for that. In case of tophat/bowtie, it would be NH:i: optional flag, where NH:i:1 denotes a read with exactly 1 hit. So you would be interested in all other numbers for NH:i:.

ADD REPLY • link 12.6 years ago by Arun 2.4k

1

Entering edit mode

Thanks for the complement of information. I'll clarify in the question what I'm looking for!

ADD REPLY • link 12.6 years ago by Nicojo ★ 1.1k

score 5 · Answer 1 · 2012-05-21

5

Entering edit mode

12.6 years ago

Leonor Palmeira 3.9k

This can be done fairly easily with classical NGS alignment tools like bwa. In bwa, once you have run the whole thing, just look for the reads that were flagged "4" in the .sam file output (i.e. no match).

ADD COMMENT • link 12.6 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

Thanks for your quick answer ;)

ADD REPLY • link 12.6 years ago by Nicojo ★ 1.1k

0

Entering edit mode

No problem, glad I could help.

ADD REPLY • link 12.6 years ago by Leonor Palmeira 3.9k

score 1 · Answer 2 · 2012-05-22

1

Entering edit mode

12.6 years ago

Ashutosh Pandey 12k

Normally for most of the aligners there is some option where you can output the unaligned reads into a different file. Otherwise look for the reads that were flagged "4" as Leonor suggested. In case of paired end reads "5", "13" and some other flags may apply too. (See http://picard.sourceforge.net/explain-flags.html) to select the read pairs or read pair that is unaligned.

ADD COMMENT • link 12.6 years ago by Ashutosh Pandey 12k

1

Entering edit mode

It's Leonor, please.

ADD REPLY • link 12.6 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

Edited. Please accept my sincere apology.

ADD REPLY • link 12.6 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks: I'll look for that option to output the unaligned reads!

ADD REPLY • link 12.6 years ago by Nicojo ★ 1.1k

0

Entering edit mode

tophat does it. I am not sure if you have to use the --keep-tmp option though. bwa provides both aligned and unaligned reads in the same file.

ADD REPLY • link 12.6 years ago by Arun 2.4k

0

Entering edit mode

And you can always look for 4 and bitwise AND with 4, even if the read is paired.

ADD REPLY • link 12.6 years ago by Arun 2.4k

score 0 · Answer 3 · 2012-05-21

0

Entering edit mode

12.6 years ago

swbarnes2 14k

With bwa, you can parse by the NM tag. NM:A:0 means that the read mapped perfectly. NM:A:3, for instnace, would mean three mismatches.

ADD COMMENT • link 12.6 years ago by swbarnes2 14k