Question

sam2tsv listing incorrect reference sequence & positions

0

Entering edit mode

3.8 years ago

Deep Chakroborty ▴ 10

Duplicate of: https://github.com/lindenb/jvarkit/issues/190

Hi can anyone help me resolve the issue I'm having with sam2tsv. It is a nifty piece of software but I have been encountering issues with it regarding the numbering of nucleotides it shows for the reference sequence.

Here's what sam2tsv tells me:

output of sam2tsv

The nucleotide string marked in red CTGGCCGAGCTAG is the read reporting the mutation T>A (line #469). But the reference sequence listed by sam2tsv (green box) doesn't match the read sequence at all. In fact it is correct sequence from the reference fasta, but it is right-shifted by 16 bases. With other reference files, this number is different, e.g. 20.

In fact, if I search the sequence ATGGAGACCCGCT in my reference sequence it spans 356-368. In contrast sam2tsv lists these residues in the range 339-352 (as seen in the green box). Off by exactly

searching reference

To summarize: 1) The position listed in the green box (reference), corresponds to the sequence in the red box (read). 2) The reference sequence listed in the green box, is off by 16 bases.

Information: PacBio HiFi CCS reads aligned using pbmm2 to custom reference (cDNA).

Files:

Reference.fa.

relevant sam2tsv output

sam containing the read in question

Command used:

java -jar '~/Git/Others_Cloned/jvarkit/dist/sam2tsv.jar' -R ../../../reference/pBG-ERBB2/ERBB2.fa 1.ERBB2_Library.bam

sam2tsv SAM sequence alignment • 1.8k views

ADD COMMENT • link 3.8 years ago by Deep Chakroborty ▴ 10

2

Entering edit mode

reported at https://github.com/lindenb/jvarkit/issues/190

ADD REPLY • link 3.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

you should post the BAM file (or the section of it that contains your alignment) that's how the issue can be better investigated

ADD REPLY • link 3.8 years ago by Istvan Albert 102k

0

Entering edit mode

Here is the sam corresponding to this read. As these are pacbio reads spanning the entire cDNA, on using samtools view <region> I still get the entire file.

I can give more reads as "control", if required.

Link to sam file also in the post.

ADD REPLY • link 3.8 years ago by Deep Chakroborty ▴ 10

0

Entering edit mode

please use samtools view -h the read is missing the SAM header.

ADD REPLY • link 3.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I'm really sorry Pierre! I sent the output of grep instead of the sam file with headers. Here is the sam file with first 4 reads. All have the same problem Read #1 is the one for which I attached the screenshots about.

ADD REPLY • link 3.8 years ago by Deep Chakroborty ▴ 10

score 2 · Accepted Answer · 2021-10-13

The problem is Fixed. It was unrelated to sam2tsv. It was due to a faulty dictionary built with picard using older version of java (in my case openjdk8).

Steps taken to mitigate the issue:

Install JDK11 and add it to system path as explained in this issue for jvarkit.
Reboot
Copy the reference.fa in a new place.
CreateDictionary (with picard) and index (with samtools faidx) the reference.fa
Clone the jvarkit repo again (mostly to clean the /dist) and re-build the tool with ./gradlew sam2tsv
Run the analysis again. It worked...

Curiously I tried with the "old" reference-dictionary-index triad... And got the same incorrect results. Then, I re-created the dictionary for the "old" reference with picard and now it works exactly as expected!

Concluded root cause: The dictionary built (with picard) using the older version of java was the culprit.