I have a fastq file from human RNA-seq data, which did not have a good mapping rate (9%) when I mapped it to the genome. So I wanted to look at some of the over-represented sequences in the reads, as that might give me some idea about what is going on. So, I wanted to know what are some good tools that can be used to find these over-represented sequences.
Initially, I used FASTQC to find them, but I read in the Biostars Handbook course that it's results are not very reliable, and they suggested Jellyfish. But the problem with Jellyfish is that you need to specify the length of the k-mer that you want to find, which is undesirable for me as I don't know what length sequences are over-represented in my data.
So, what is a good tool that can help me find over-represented sequences in my reads, without me having to specify the precise length of the sequence to find?
Did you try fastqc ?
yes, but as I mentioned, I read in the Biostars Handbook course that it's results are not very reliable
Use
clumpify
from BBMap to see if you have a problem with PCR/optical duplicates ( A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files )Going after sub-sequence size regions that are over-represented may be a futile exercise in terms of figuring out why you have low alignments. What have you discovered (using blast for example) about the rest 91% reads that do not map?
Thanks for your comment. I tried to blast a couple of the unmapped reads in BLAST, and the top hits were uncultured prokaryote gene, and a Streptomyces gene respectively. Many others did not return a hit when doing a BLAST search in the 'highly similar sequences' category (which is the default). What does that suggest?
You could potentially have a contamination problem. For that I suggest that you use
bbsplit.sh
from BBMap suite with human genome and bin the reads into two pools (ones that map to human and rest that don't). Depending on how many reads end up in human bin you can decide if it is worth spending time on trying to figure out what the rest of the sequences are or to drop this sample.why do you have 2 files in1 and in2 ? Are they the mapped and unmapped reads (fastq converted from the bam)?
If you have paired end reads. If you have single end reads then use
in=
andoutu=
instead. These are original data files.use a loop ?
Yes, that's a good idea, but with Jellyfish, it's not easy to get the most over-represented 'sequence' (based on the minimal experience I have with it). It allows you to dump the mer_counts to a fasta file where you can see the k-mers, but they are randomly distributed across the fasta file (not sorted). But yes, I can sort them. I was wondering if there is a tool to find over-represented sequences just like FASTQC does, but I guess I can work with jellyfish for now. Thanks!
You can try
kmercountexact.sh
from BBMap. Make sure you have enough memory available.I tried it, but it's output is just a list of depth vs count, like -
#Depth Count
1 50165628
2 7292674
3 2951163
4 1563013
5 964075
6 652937
7 474538
8 360501
9 285766
10 232038
...
So, I don't know what exact 'sequences' are over-represented?Try this:
kmercountexact.sh in=your.fq.gz out=stdout fastadump=f | sort -k2,2rn - > sorted_kmer.out
Default value for k-mer is 31.You should get: