So my question comes in two parts:
First of all is what I'm trying to do within reason given the tools I am using?
I am investigating the shuffling effects of a recombinase on a known reporter sequence which subsequently generates libraries of unique sequences. By simulating all of the possible outcomes through a python script I am given 384 unique permutations that could potentially arise from the shuffling of the cassette. I am attempting therefore to map my next-gen sequencing reads to this library of simulated sequences as a reference using bbmap(bbsplit.sh) to bin the reads according to their alignment to the respective references. I have simulated references as fasta and fastq files but am running into the following error:
bbsplit.sh ref=Rci4_shuffled_library.fasta in=Rci4_test_reads.fasta basename=o%.fastq outu1=cleaned.fastq
java -ea -Xmx8353m -Xms8353m -cp /home/********/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 ref=Rci4_shuffled_library.fasta in=Rci4_test_reads.fasta basename=o%.fastq outu1=cleaned.fastq
Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, ref=Rci4_shuffled_library.fasta, in=Rci4_test_reads.fasta, basename=o%.fastq, outu1=cleaned.fastq]
Converted arguments to [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in=Rci4_test_reads.fasta, basename=o%.fastq, outu1=cleaned.fastq, ref_Rci4_shuffled_library=Rci4_shuffled_library.fasta]
Merged reference file ref/genome/1/merged_ref_9223372035509296106.fa.gz already exists; skipping merge.
Ref merge time: 0.036 seconds.
Exception in thread "main" java.lang.AssertionError: Invalid input file: 'Rci4_test_reads.fasta'
at align2.AbstractMapper.preparse0(AbstractMapper.java:822)
at align2.AbstractMapper.<init>(AbstractMapper.java:54)
at align2.BBMap.<init>(BBMap.java:42)
at align2.BBMap.main(BBMap.java:30)
at align2.BBSplitter.main(BBSplitter.java:48)
So...
1) Is what I'm trying to do achievable with bbmap?
2) If so, why is it failing to recognise my input files? I have tried both the fasta and fastq format for my input files but am hit with the same error message. I have also tried increasing my heap size using -Xmx% but no matter what I select I am told it is an invalid maximum heap size. I tried to check my settings using:
java -XshowSettings:vm
which returns an (estimated) max heap size of 3.10G which is above the inputs I have tried.
P.S. I'm new to linux so please excuse my coding illiteracy. I'm running Ubuntu on a Windows PC if that helps too.
If you wish to bin the reads using
bbsplit.sh
then you will need to provide those references as individual fasta files. Idea of usingbbsplit
is you provide multiplereference genomes
and then get reads binned according to how they map. e.g.bbsplit.sh in=reads.fq ref=seq1.fa,seq2.fa..seq384.fa
You could simply align the data to your 384 read reference using plain
bbmap.sh
. That would give you an aligned data file. You could then runsamtools idxstats
to get number of reads aligned to each of your reference.Choose one of the two methods above depending on your ultimate goal.
So as for the first method, I'm trying to avoid manually inputting individual files as this is laborious and inefficient, especially if I am to be working with multiple variations of similar but distinct shuffled libraries. If there's no way to do this with bbsplit that's fair enough. I was originally using minimap2 instead but bbsplit caught my attention for being able to handle multiple reference sequences and bin them. I was under the impression this was possible by inputting a single database.
For the second option I may use minimap2 instead for the alignment which I do know accepts reference databases as single files and will try using idxstats to get read numbers. I should have mentioned, my ultimate goal is to output the reads aligned to each reference as individual files (per reference) that could then be directly input into a visualisation script which calls from a database to determine the block order in the shuffled cassete reads. So something like:
Which could easily be turned into a heatmap or something similar to visually represent the block positions and relative frequencies.
not sure but it might be it fails because you enter
fasta
format and requestfastq
as output ?if you say both types of input fail, can you post an extract of both here? first dozen lines of each should do.