Hello everyone, I'm aligning a Nanopore sequence fastq file using bbmap (I'm trying to use bbmap instead of minimap2 because with the latter I have losses of aligned sequences). I'm encountering this error for many reads and I can't figure out how to solve it. Probably my sequences are too long, but even using the mapPacBio.sh script, I get a similar error. What can I do? Below are my commands and an example of the error to keep it concise. Thanks to all
./bbmap.sh in=/home/input.fastq out=/home/input_1.sam ref=/home/reference_pi4ka/PI4KA.fasta maxindel=5800
Exception in thread "Thread-4" java.lang.AssertionError: Read 0, length 1597,
exceeds the limit of 600 You can map the reads in chunks by reformatting to fasta,
then mapping with the setting 'fastareadlen=600
bbmap.sh is designed for Illumina (or similar-length) reads; as GenoMax stated, mapPacBio.sh is the better option here. However, in general, you should use a long-read aligner like minimap2 for very long reads over ~6kbp if you want the long-range data.
The error you are mentioning should never occur in mapPacBio so it would help if you could give the command line and error message for that. Also, I don't understand what the problem is with minimap2; perhaps you could elaborate? Are not enough reads getting assigned to anything?
Thanks for the explanation Brian. In my initial experiment, I aligned my cDNA sequence from ONT MinION using minimap2. Then I built the bam file. I noticed that the alignment focused only on a specific region of the reference with a 10X coverage and even lower, that is, the IGV display cut off some sequences that were present anyway, as if it lost some sequences in alignment. In fact, with samtools flagstats I only noticed about 70% alignment. So I thought about extracting the reads with maximum length to try to recover something in the visualization but in reality this new fastq generated 0% alignment with minimap2. Surely there is something in the input sequences but I don't understand what. I hope it was clear.
It is possible that
minimap2
is soft-clipping parts of read that are not aligning. There is an option in IGV to show the clipped parts. You could turn that on to confirm. This is going to go back to the quality of your data. It is not likely thatminimap2
is making an error.Thanks GenoMax, I 've select the "Alignments" tab of preferences and check "Show soft-clipped bases" from IGV and this is my result
I tried to extract the maximum length read from the fastq file and noticed the presence of a bacterial contaminant in this sequence. Can this have an impact? How can I remove the contaminant in order to realign the clean sequence?
Based on the header example you posted in your other thread it looks like these are barcoded samples. Do you know if the barcodes were trimmed from the data?
What we are seeing above could be remnants of adapter/barcodes. You could use something like Remove Soft Clipped Bases to remove those softclipped bases.
What is your aim in this experiment?
Roughly speaking, the aim was to identify the presence of a deletion in a transcript related to the PI4KA gene, where we know the deletion is present in a patient, to support DNA analysis. In my alignment, I would have expected the reads to align at the beginning of the sequence, then encounter a gap for the deletion, and finally continue in the visualization with IGV. I'm not sure if I've been clear. I've used both Biostar from the link you sent me and Chopper to try to clean up any remaining residues, including the contaminant, but I get the same result as before, this time without clipped bases, but the alignment remains localized in the central part of the reference approximately.
Perhaps I could try extracting the unmapped reads and see on IGV? Or reads fail o unclassified?
Based on the data that is not happening. So perhaps you need to investigate why that is so. Is this direct RNA sequencing or how is the amplicon being produced?
I went back through the analysis and saw that the amplification bands weren't exactly perfect but I could see something. Maybe a primer went wrong? I don't know, we're evaluating. Sequencing was performed on the amplification product