filtering reads based on read counts
1
0
Entering edit mode
9.6 years ago

I have parsed out a fastq file and have a text file like

read_id      read-count
aaaa         10
bbbb         1000
ccccc        10000
dddd         1

and so on. It is clear that some read_ids which have very large or very small read_counts are outliers. Can you suggest a scheme I could use in R/python/pandas with which I can systematically filter for reads which lie within a range of values (based on mean or median of the read_counts)?

Thanks

next-gen R • 1.9k views
ADD COMMENT
1
Entering edit mode

Is the read ID the ID from the sequencer or is this some sort of ID given to a specific sequence? In other words, how were these numbers generated and what's the biological context? We'd need answers to those questions to give you a reliable answer.

ADD REPLY
0
Entering edit mode

The reads are aligned with bowtie to a library of oligos which have IDs given in the read_id column. The read_count values are an example of the large range of values. Example: sequence dddd is covered on 1 time but sequence ccccc is covered 10000 times.

ADD REPLY
1
Entering edit mode

Right, but what then do the oligos represent? These could be assembled contigs, in which case the raw count is less interesting that the median depth. Alternatively, these could be small RNAs, in which case the high counts are biologically meaningful. One could conceive of additional possibilities. You still need to give us more details, since the answer depends on the biological context of the data.

ADD REPLY
0
Entering edit mode
9.6 years ago
arnstrm ★ 1.9k

You need to be doing digital normalization of the reads. I've used Khmer before and it does this job very nicely.

See this article about Digital Normalization: http://arxiv.org/abs/1203.4802

EDIT: from you latest comment, it looks like you want to normalize RNAseq kind of data (counts table). There are plenty of RNA seq R packages that you can use it for this purpose. The method that I suggested work directly on the raw fastq file to remove overly redundant reads from the input file.

ADD COMMENT

Login before adding your answer.

Traffic: 2419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6