Entering edit mode
2.8 years ago
O.rka
▴
740
I have one sample that is WAY different than all the others and I want to find out why. The first thing I want to do is to see if it is more or less complex than the other samples.
Are there any tools to do this?
kmergenie for k-mers. How about GC% calculation or nucleotide distribution plots?
I was looking into this but it doesn't work on fastq nor does it work with stdin. I have a bunch of fastq files that are gzipped and I really don't have the storage to decompress and convert to fasta. I know I could run it one at a time and delete as I go but that seems like a roundabout way. There's gotta be another option. I've tried installing gerbil but I can't get the dependencies to work. I use conda for all of my environments and most of the newer tools don't have conda recipes so getting their dependencies to work is not trivial.
Copy/pasted from kmergenie readme (http://kmergenie.bx.psu.edu/README):
Kmergenie also takes a list (of files). GC% can be calculated by tools such as
seqkit
for each file or bunch of files. Sourmash can be run on fastq.gz files.(https://sourmash.readthedocs.io/en/latest/tutorial-basic.html)Weird?! I had an error when I ran it. I’ll give it another try. Maybe it was an outdated version. Thank you
look at the compression of the fastq ? the less the fastq is complex, the more it will be compressed ? :-P
Sourmash can compare multiple genomes/reads using FracMinHash and build a tree. It's kind of clustering, but it may not tell why one is different from others.
For RNAseq, why not continue the downstream analysis to see whether the difference comes from the abnormal expression of certain transcripts.