Question

basic command in bash

0

Entering edit mode

6.0 years ago

luzglongoria ▴ 50

Hi there,

I'm working with parasite transcriptomes and (of course) my reads are mixed with host (bird) and parasite (malaria). I order to keep the maximum amount of parasite RNA I have aligned my reads with a close related parasite species and keep these reads. After that, I have aligned all the reads with a close related bird species genome and only keep reads that do not map with it. Anyway, I still think that I have quite a lot reads from the bird (from the second analyses). I know that avian malaria parasites have very high AT content so, in order to only keep parasite reads I would like to filter by AT content. Maybe it's a very basic question buy I would like to know how to do this filtering. Probably it is with the command "grep" but not sure how to do it. Any suggestions? Thanks in advance

Ps: mapping was performed with hisat2 and how I have one .fq file containing all the reads that did not map with the close related bird species (the one I want to filter) and another .fq file containing reads that did map with the close related parasite species (I don't need to filter this one)

RNA-Seq filter grep • 1.0k views

ADD COMMENT • link 6.0 years ago by luzglongoria ▴ 50

0

Entering edit mode

I'm not sure filtering by nucleotide content alone is necessarily a good idea, since the reads are going to be kind of short so you will might still get reads overlapping in the 2 species.

It's worth a go though, so you might want to check out this thread:

https://stackoverflow.com/questions/4451003/binning-sequence-reads-by-gc-content

While you could in theory do this with bash commands and grep work, it would be quite cumbersome, so you're probably best just appealing to a python solution or something (assuming speed doesn't become an issue).

ADD REPLY • link 6.0 years ago by Joe 21k

0

Entering edit mode

I don't think this is that simple, if you wanted to do it by yourself using bash scripting you would need to do something like the following:

Print the read header and its corresponding AT% in a new file.
Given an AT% threshold or range, keep only the headers of the reads you want
Grep for these headers on the original reads file and their corresponding info (sequence, separator and quality if its fastq format).

Not the hardest thing in the world as well, but requires some work and speed of the process might be an issue. You could also try to look for some sort of already existant tool that does this.

ADD REPLY • link 6.0 years ago by n,n ▴ 370