I have parsed out a fastq file and have a text file like
read_id read-count
aaaa 10
bbbb 1000
ccccc 10000
dddd 1
and so on. It is clear that some read_ids which have very large or very small read_counts are outliers. Can you suggest a scheme I could use in R/python/pandas with which I can systematically filter for reads which lie within a range of values (based on mean or median of the read_counts)?
Thanks
Is the read ID the ID from the sequencer or is this some sort of ID given to a specific sequence? In other words, how were these numbers generated and what's the biological context? We'd need answers to those questions to give you a reliable answer.
The reads are aligned with bowtie to a library of oligos which have IDs given in the read_id column. The read_count values are an example of the large range of values. Example: sequence dddd is covered on 1 time but sequence ccccc is covered 10000 times.
Right, but what then do the oligos represent? These could be assembled contigs, in which case the raw count is less interesting that the median depth. Alternatively, these could be small RNAs, in which case the high counts are biologically meaningful. One could conceive of additional possibilities. You still need to give us more details, since the answer depends on the biological context of the data.