How to find the unique list of mismatches of reads with the reference genome
2
0
Entering edit mode
7.2 years ago
Sama ▴ 60

Hi,

I have Illumina reads and a reference genome. I aligned reads to the genome and I have .bam and .sam files. I want the list of the mismatches of the reads with the reference genome but if one mismatch happened in multiple reads, I want it to be reported only once. For example if two reads have the mismatch of A->G on position 1000 of the reference genome, I want something like this: G,1000,2

I want the output file to be something like this:

Mismatch,Position,Frequency
G,1000,2
A,3278,7
G,78732,5
T,89783,8
C,87494,8
A,13732,5
...

Is there any tool available for doing that?

next-gen alignment genome • 5.4k views
ADD COMMENT
0
Entering edit mode

Hi Pierre Lindenbaum,

I read all your documentation and all other posts on biostars! sam2tsv's a nice tool and serves my need. However, I have trouble understanding what each alphabet stands for in Read-Qual column and one important question is is that what does 'M' exactly mean? Match or a mismatch? and if it represents Mismatches, then the first line in the output has T's in both ref and alt. Shouldn't it be a match? I apologize if my questions are too naive. I thinking I'm missing out on a detail here. I would greatly appreciate your help on this.

$ java -jar dist/sam2tsv.jar -A  \
    -r samtools-0.1.18/examples/toy.fa 
      samtools-0.1.18/examples/toy.sam
r001    163 ref 0   T   .   7   T   M
r001    163 ref 1   T   .   8   T   M
r001    163 ref 2   A   .   9   A   M

Thanks in advance!

ADD REPLY
0
Entering edit mode

You should use ADD COMMENT/ADD REPLY to add this comment under @Pierre's answer.

ADD REPLY
3
Entering edit mode
7.2 years ago
pfs ▴ 280

It looks like you need to implement a variant calling step and then do some data parsing to get your desired out put format. See http://www.htslib.org/workflow/

ADD COMMENT
3
Entering edit mode
7.2 years ago

using sam2tsv: http://lindenb.github.io/jvarkit/Sam2Tsv.html


   $ java -jar dist/sam2tsv.jar -R ref.fa in.bam | awk '$8!=$5 && ($9=="M" || $9=="X")' | cut -f 3,4,5 | sort | uniq -c


      2 ref2    0   A
      1 ref2    0   C
      1 ref2    0   G
      2 ref2    0   T
      3 ref2    10  A
      2 ref2    10  T
      3 ref2    11  A
      1 ref2    11  C
      1 ref2    11  G
      1 ref2    12  A
      2 ref2    12  C
      2 ref2    12  T
      4 ref2    13  A
      2 ref2    13  C
      3 ref2    14  A
      1 ref2    14  C
      1 ref2    14  G
      1 ref2    14  T
      5 ref2    15  A
      1 ref2    15  T
      1 ref2    16  A
      1 ref2    16  C
      2 ref2    16  G
      2 ref2    16  T
      4 ref2    17  A
      1 ref2    17  C
      1 ref2    17  T
      4 ref2    18  A
      2 ref2    18  G
      3 ref2    19  A
      1 ref2    19  C
      1 ref2    19  G
      1 ref2    19  T
ADD COMMENT
0
Entering edit mode

Thank you for your reply. I have a question. What are first and third columns?

ADD REPLY
1
Entering edit mode

in the final output ? occurence/chromosome/position/alt

ADD REPLY
0
Entering edit mode

This is 5 first lines of running your command on my data:

3242056 Escherichia 0   A
3435134 Escherichia 0   C
3169976 Escherichia 0   G
 85 Escherichia 0   N
2952800 Escherichia 0   T

It is weird if the frequency of A in first position is 3242056.

Also the third column numbers range is 0-100 but the reference genome is much longer than 100.

ADD REPLY
0
Entering edit mode

Also the third column numbers range is 0-100 but the reference genome is much longer than 100.

I wonder if the offsets of the columns in my 'cut' are the correct one, may be I've used the position in the read. Can you please check (i'm away from my tools) with

 java -jar dist/sam2tsv.jar -R ref.fa in.bam | more
ADD REPLY
0
Entering edit mode

Hi Pierre,

I ran into the following error, which seems to be caused by java version.

$ module load java/jdk1.7.0_25 $ java -jar ./dist/sam2tsv.jar -R ref.fa chr1.bam Exception in thread "main" java.lang.UnsupportedClassVersionError: com/github/lindenb/jvarkit/tools/sam2tsv/Sam2Tsv : Unsupported major.minor version 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:792) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

$ module unload java/jdk1.7.0_25 $ module load java/jdk1.8.0_60 $ java -jar ./dist/sam2tsv.jar -R ref.fa chr1.bam [SEVERE][Sam2Tsv]scan error: java.lang.ArrayIndexOutOfBoundsException: 0 at com.github.lindenb.jvarkit.tools.sam2tsv.Sam2Tsv.printAln(Sam2Tsv.java:326) at com.github.lindenb.jvarkit.tools.sam2tsv.Sam2Tsv.scan(Sam2Tsv.java:468) at com.github.lindenb.jvarkit.tools.sam2tsv.Sam2Tsv.doWork(Sam2Tsv.java:504) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1145) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1303) at com.github.lindenb.jvarkit.tools.sam2tsv.Sam2Tsv.main(Sam2Tsv.java:528) [SEVERE][Sam2Tsv]0 java.lang.RuntimeException: 0 at com.github.lindenb.jvarkit.tools.sam2tsv.Sam2Tsv.scan(Sam2Tsv.java:476) at com.github.lindenb.jvarkit.tools.sam2tsv.Sam2Tsv.doWork(Sam2Tsv.java:504) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1145) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1303) at com.github.lindenb.jvarkit.tools.sam2tsv.Sam2Tsv.main(Sam2Tsv.java:528) Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at com.github.lindenb.jvarkit.tools.sam2tsv.Sam2Tsv.printAln(Sam2Tsv.java:326) at com.github.lindenb.jvarkit.tools.sam2tsv.Sam2Tsv.scan(Sam2Tsv.java:468) ... 4 more

READ_NAME FLAG CHROM READ_POS BASE QUAL REF_POS REF OP

Can you tell me how to fix it?

Thanks a lot in advance.

ADD REPLY
0
Entering edit mode

can you please try to validate your bam with picard: https://broadinstitute.github.io/picard/command-line-overview.html#ValidateSamFile

otherwise, submit a new issue with a minimal failing bam : https://github.com/lindenb/jvarkit/issues/

ADD REPLY
1
Entering edit mode

Hi Pierre Lindenbaum,

I read all your documentation and all other posts on biostars! sam2tsv's a nice tool and serves my need. However, I have trouble understanding what each alphabet stands for in Read-Qual column and one important question is is that what does 'M' exactly mean? Match or a mismatch? and if it represents Mismatches, then the first line in the output has T's in both ref and alt. Shouldn't it be a match? I apologize if my questions are too naive. I thinking I'm missing out on a detail here. I would greatly appreciate your help on this.

Thanks!

ADD REPLY
0
Entering edit mode

according to ValidateSamFile, the following errors were detected:

' A record is missing a read group'
'ERROR: Read groups is empty'
'Maximum output of [100] errors reached.'

However, i found when i truncated bam/sam with the interval i am interested in, i.e., chr1:101-103, and loaded java/jdk-10.0.1, it works well.

this is a little wired.

ADD REPLY

Login before adding your answer.

Traffic: 1834 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6