Question

filtering reads based on read counts

0

Entering edit mode

10.4 years ago

sumithrasank75 ▴ 140

I have parsed out a fastq file and have a text file like

read_id      read-count
aaaa         10
bbbb         1000
ccccc        10000
dddd         1

and so on. It is clear that some read_ids which have very large or very small read_counts are outliers. Can you suggest a scheme I could use in R/python/pandas with which I can systematically filter for reads which lie within a range of values (based on mean or median of the read_counts)?

Thanks

next-gen R • 2.2k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.4 years ago by sumithrasank75 ▴ 140

1

Entering edit mode

Is the read ID the ID from the sequencer or is this some sort of ID given to a specific sequence? In other words, how were these numbers generated and what's the biological context? We'd need answers to those questions to give you a reliable answer.

ADD REPLY • link 10.4 years ago by Devon Ryan 105k

0

Entering edit mode

The reads are aligned with bowtie to a library of oligos which have IDs given in the read_id column. The read_count values are an example of the large range of values. Example: sequence dddd is covered on 1 time but sequence ccccc is covered 10000 times.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.4 years ago by sumithrasank75 ▴ 140

1

Entering edit mode

Right, but what then do the oligos represent? These could be assembled contigs, in which case the raw count is less interesting that the median depth. Alternatively, these could be small RNAs, in which case the high counts are biologically meaningful. One could conceive of additional possibilities. You still need to give us more details, since the answer depends on the biological context of the data.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.4 years ago by Devon Ryan 105k

Ram · Answer 1 · 2015-05-05

You need to be doing digital normalization of the reads. I've used Khmer before and it does this job very nicely.

See this article about Digital Normalization: http://arxiv.org/abs/1203.4802

EDIT: from you latest comment, it looks like you want to normalize RNAseq kind of data (counts table). There are plenty of RNA seq R packages that you can use it for this purpose. The method that I suggested work directly on the raw fastq file to remove overly redundant reads from the input file.