Question

Finding over-represented sequences in fastq file

0

Entering edit mode

7.4 years ago

c_u ▴ 530

I have a fastq file from human RNA-seq data, which did not have a good mapping rate (9%) when I mapped it to the genome. So I wanted to look at some of the over-represented sequences in the reads, as that might give me some idea about what is going on. So, I wanted to know what are some good tools that can be used to find these over-represented sequences.

Initially, I used FASTQC to find them, but I read in the Biostars Handbook course that it's results are not very reliable, and they suggested Jellyfish. But the problem with Jellyfish is that you need to specify the length of the k-mer that you want to find, which is undesirable for me as I don't know what length sequences are over-represented in my data.

So, what is a good tool that can help me find over-represented sequences in my reads, without me having to specify the precise length of the sequence to find?

RNA-Seq fastq • 4.8k views

ADD COMMENT • link 7.4 years ago by c_u ▴ 530

0

Entering edit mode

Did you try fastqc ?

ADD REPLY • link 7.4 years ago by cpad0112 21k

0

Entering edit mode

yes, but as I mentioned, I read in the Biostars Handbook course that it's results are not very reliable

ADD REPLY • link 7.4 years ago by c_u ▴ 530

0

Entering edit mode

Use clumpify from BBMap to see if you have a problem with PCR/optical duplicates ( A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files )

Going after sub-sequence size regions that are over-represented may be a futile exercise in terms of figuring out why you have low alignments. What have you discovered (using blast for example) about the rest 91% reads that do not map?

ADD REPLY • link 7.4 years ago by GenoMax 150k

0

Entering edit mode

Thanks for your comment. I tried to blast a couple of the unmapped reads in BLAST, and the top hits were uncultured prokaryote gene, and a Streptomyces gene respectively. Many others did not return a hit when doing a BLAST search in the 'highly similar sequences' category (which is the default). What does that suggest?

ADD REPLY • link 7.4 years ago by c_u ▴ 530

1

Entering edit mode

You could potentially have a contamination problem. For that I suggest that you use bbsplit.sh from BBMap suite with human genome and bin the reads into two pools (ones that map to human and rest that don't). Depending on how many reads end up in human bin you can decide if it is worth spending time on trying to figure out what the rest of the sequences are or to drop this sample.

bbsplit.sh in1=reads1.fq.gz in2=reads2.fq.gz ref=human.fa basename=out_%_#.fq outu1=human_R1.fq.gz outu2=human_R2.fq.gz ambig2=toss &> bbsplit.log

ADD REPLY • link 7.4 years ago by GenoMax 150k

0

Entering edit mode

why do you have 2 files in1 and in2 ? Are they the mapped and unmapped reads (fastq converted from the bam)?

ADD REPLY • link 7.4 years ago by c_u ▴ 530

0

Entering edit mode

If you have paired end reads. If you have single end reads then use in= and outu= instead. These are original data files.

ADD REPLY • link 7.4 years ago by GenoMax 150k

0

Entering edit mode

which is undesirable for me as I don't know what length sequences are over-represented in my data.

use a loop ?

ADD REPLY • link 7.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Yes, that's a good idea, but with Jellyfish, it's not easy to get the most over-represented 'sequence' (based on the minimal experience I have with it). It allows you to dump the mer_counts to a fasta file where you can see the k-mers, but they are randomly distributed across the fasta file (not sorted). But yes, I can sort them. I was wondering if there is a tool to find over-represented sequences just like FASTQC does, but I guess I can work with jellyfish for now. Thanks!

ADD REPLY • link 7.4 years ago by c_u ▴ 530

1

Entering edit mode

You can try kmercountexact.sh from BBMap. Make sure you have enough memory available.

 kmercountexact.sh

Written by Brian Bushnell
Last modified October 18, 2017

Description:  Counts the number of unique kmers in a file.
Generates a kmer frequency histogram and genome size estimate (in peaks output),
and prints a file containing all kmers and their counts.
Supports K=1 to infinity, though not all values are allowed.
SEE ALSO: bbnorm.sh/khist.sh loglog.sh, and kmercountmulti.sh.

Usage:   kmercountexact.sh in=<file> khist=<file> peaks=<file> out=<file>

Input may be fasta or fastq, compressed or uncompressed.
Output may be stdout or a file.  out, khist, and peaks are optional.

ADD REPLY • link 7.4 years ago by GenoMax 150k

0

Entering edit mode

I tried it, but it's output is just a list of depth vs count, like - #Depth Count 1 50165628 2 7292674 3 2951163 4 1563013 5 964075 6 652937 7 474538 8 360501 9 285766 10 232038 ... So, I don't know what exact 'sequences' are over-represented?

ADD REPLY • link 7.4 years ago by c_u ▴ 530

0

Entering edit mode

Try this: kmercountexact.sh in=your.fq.gz out=stdout fastadump=f | sort -k2,2rn - > sorted_kmer.out Default value for k-mer is 31.

You should get:

GTGGTCTCCACTCCCGCCTTGACGGGGCTGC 1016
GTCTCCACTCCCGCCTTGACGGGGCTGCTAT 1006
GGTGGTCTCCACTCCCGCCTTGACGGGGCTG 1005
TCTCCACTCCCGCCTTGACGGGGCTGCTATC 1004
TGGTGGTCTCCACTCCCGCCTTGACGGGGCT 1003
TGGTCTCCACTCCCGCCTTGACGGGGCTGCT 1002
CTCCACTCCCGCCTTGACGGGGCTGCTATCT 999

ADD REPLY • link 7.4 years ago by GenoMax 150k