Question

Is there a tool that will tell me how "complex" (e.g., number of unique k-mers) are in a fastq file?

0

Entering edit mode

3.7 years ago

O.rka ▴ 750

I have one sample that is WAY different than all the others and I want to find out why. The first thing I want to do is to see if it is more or less complex than the other samples.

Are there any tools to do this?

genomics rnaseq fastq • 1.8k views

ADD COMMENT • link 3.7 years ago by O.rka ▴ 750

1

Entering edit mode

kmergenie for k-mers. How about GC% calculation or nucleotide distribution plots?

ADD REPLY • link 3.7 years ago by cpad0112 21k

0

Entering edit mode

I was looking into this but it doesn't work on fastq nor does it work with stdin. I have a bunch of fastq files that are gzipped and I really don't have the storage to decompress and convert to fasta. I know I could run it one at a time and delete as I go but that seems like a roundabout way. There's gotta be another option. I've tried installing gerbil but I can't get the dependencies to work. I use conda for all of my environments and most of the newer tools don't have conda recipes so getting their dependencies to work is not trivial.

ADD REPLY • link 3.7 years ago by O.rka ▴ 750

0

Entering edit mode

Copy/pasted from kmergenie readme (http://kmergenie.bx.psu.edu/README):

reads_file is either a single FASTA, FASTQ, FASTA.gz, FASTQ.gz file or a list of file names, one per line. For example:

Kmergenie also takes a list (of files). GC% can be calculated by tools such as seqkit for each file or bunch of files. Sourmash can be run on fastq.gz files.(https://sourmash.readthedocs.io/en/latest/tutorial-basic.html)

ADD REPLY • link 3.7 years ago by cpad0112 21k

0

Entering edit mode

Weird?! I had an error when I ran it. I’ll give it another try. Maybe it was an outdated version. Thank you

ADD REPLY • link 3.7 years ago by O.rka ▴ 750

1

Entering edit mode

look at the compression of the fastq ? the less the fastq is complex, the more it will be compressed ? :-P

gunzip -c sample1.fastq.gz | paste - - - - | cut -f 2 | sort -T . | gzip --best | wc -c

ADD REPLY • link 3.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Sourmash can compare multiple genomes/reads using FracMinHash and build a tree. It's kind of clustering, but it may not tell why one is different from others.

For RNAseq, why not continue the downstream analysis to see whether the difference comes from the abnormal expression of certain transcripts.

ADD REPLY • link 3.7 years ago by shenwei356 8.7k