Hi there,
I'm working with parasite transcriptomes and (of course) my reads are mixed with host (bird) and parasite (malaria). I order to keep the maximum amount of parasite RNA I have aligned my reads with a close related parasite species and keep these reads. After that, I have aligned all the reads with a close related bird species genome and only keep reads that do not map with it. Anyway, I still think that I have quite a lot reads from the bird (from the second analyses). I know that avian malaria parasites have very high AT content so, in order to only keep parasite reads I would like to filter by AT content. Maybe it's a very basic question buy I would like to know how to do this filtering. Probably it is with the command "grep" but not sure how to do it. Any suggestions? Thanks in advance
Ps: mapping was performed with hisat2 and how I have one .fq file containing all the reads that did not map with the close related bird species (the one I want to filter) and another .fq file containing reads that did map with the close related parasite species (I don't need to filter this one)
I'm not sure filtering by nucleotide content alone is necessarily a good idea, since the reads are going to be kind of short so you will might still get reads overlapping in the 2 species.
It's worth a go though, so you might want to check out this thread:
https://stackoverflow.com/questions/4451003/binning-sequence-reads-by-gc-content
While you could in theory do this with bash commands and grep work, it would be quite cumbersome, so you're probably best just appealing to a python solution or something (assuming speed doesn't become an issue).
I don't think this is that simple, if you wanted to do it by yourself using bash scripting you would need to do something like the following:
Not the hardest thing in the world as well, but requires some work and speed of the process might be an issue. You could also try to look for some sort of already existant tool that does this.