Duplicate of: https://github.com/lindenb/jvarkit/issues/190
Hi can anyone help me resolve the issue I'm having with sam2tsv. It is a nifty piece of software but I have been encountering issues with it regarding the numbering of nucleotides it shows for the reference sequence.
Here's what sam2tsv tells me:
The nucleotide string marked in red CTGGCCGAGCTAG
is the read reporting the mutation T>A (line #469).
But the reference sequence listed by sam2tsv (green box) doesn't match the read sequence at all. In fact it is correct sequence from the reference fasta, but it is right-shifted by 16
bases. With other reference files, this number is different, e.g. 20.
In fact, if I search the sequence ATGGAGACCCGCT
in my reference sequence it spans 356-368
. In contrast sam2tsv lists these residues in the range 339-352
(as seen in the green box). Off by exactly
To summarize: 1) The position listed in the green box (reference), corresponds to the sequence in the red box (read). 2) The reference sequence listed in the green box, is off by 16 bases.
Information:
PacBio HiFi CCS reads aligned using pbmm2
to custom reference (cDNA).
Files:
sam containing the read in question
Command used:
java -jar '~/Git/Others_Cloned/jvarkit/dist/sam2tsv.jar' -R ../../../reference/pBG-ERBB2/ERBB2.fa 1.ERBB2_Library.bam
reported at https://github.com/lindenb/jvarkit/issues/190
you should post the BAM file (or the section of it that contains your alignment) that's how the issue can be better investigated
Here is the sam corresponding to this read. As these are pacbio reads spanning the entire cDNA, on using
samtools view <region>
I still get the entire file.I can give more reads as "control", if required.
Link to sam file also in the post.
please use
samtools view -h
the read is missing the SAM header.I'm really sorry Pierre! I sent the output of grep instead of the sam file with headers. Here is the sam file with first 4 reads. All have the same problem Read #1 is the one for which I attached the screenshots about.