Question

How to extract reads with a known variant form a bam file

0

Entering edit mode

7.0 years ago

kumajis • 0

I want to extract reads with a known variant form a bam file, I have tried like below:

$samtools -q 30 view  <in bam> <position> | grep -i <pattern sequence(like AATTCG)> # reference pattern is AA[C]TCG

However maybe it would be some problem like if a SNP in the pattern or recurrent pattern in reads

So if any one know some better way to do the job, maybe mpileup function could help but I don not know how to do it.

Thanks

SNP BAM • 7.7k views

ADD COMMENT • link updated 7.0 years ago by dariober 15k • written 7.0 years ago by kumajis • 0

0

Entering edit mode

see also: how to extract allele specific reads from bamfiles?

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

score 6 · Answer 1 · 2017-11-16

6

Entering edit mode

7.0 years ago

Pierre Lindenbaum 164k

using http://lindenb.github.io/jvarkit/SamJdk.html

with the following script (searching for a base 'T' on 'rotavirus' at 1044.

final String contig= "rotavirus";
final int mutpos = 1044;
final char mutbase='T';
if(record.getReadUnmappedFlag()) return false;
if(!record.getContig().equals(contig)) return false;
if(record.getEnd() < mutpos) return false;
if(record.getStart() > mutpos) return false;
int readpos = record.getReadPositionAtReferencePosition(mutpos);
if(readpos<1) return false;
readpos--;
final byte[]    bases= record.getReadBases();
if(bases[readpos]==mutbase) return true;
return false;

usage:

 java -jar dist/samjdk.jar -f script.js input.bam

output:

@HD VN:1.5  GO:none SO:coordinate
@SQ SN:rotavirus    LN:1074
@RG ID:S2   SM:S2
@PG ID:bwa  PN:bwa  VN:0.7.12-r1044 CL:../bwa/bwa mem -R @RG\tID:S2\tSM:S2 ref.fa S2_01_R1.fq.gz S2_01_R2.fq.gz
@PG ID:bwa.1    PN:bwa  VN:0.7.12-r1044 CL:../bwa/bwa mem -R @RG\tID:S2\tSM:S2 ref.fa S2_02_R1.fq.gz S2_02_R2.fq.gz
@PG ID:bwa.2    PN:bwa  VN:0.7.12-r1044 CL:../bwa/bwa mem -R @RG\tID:S2\tSM:S2 ref.fa S2_03_R1.fq.gz S2_03_R2.fq.gz
rotavirus_598_1046_7:0:0_4:0:0_7    83  rotavirus   977 60  70M =   598 -449    TATAAATATTTACCATCTACACATGACCCGCTATGAGCACAATAGTTAAAAGCTAACACTGTCAAAATCC  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  MD:Z:2A7A18T37A2    RG:Z:S2 NM:i:4  AS:i:54 XS:i:0
rotavirus_486_1047_6:0:0_1:0:0_1cc  83  rotavirus   978 60  70M =   486 -562    AAAAATATTAACCATCTACACATGACCCTCTATGAGCACAATAGTTAAAAGCTAACACTGTCAAAATCCT  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  MD:Z:66A3   RG:Z:S2 NM:i:1  AS:i:66 XS:i:0
rotavirus_492_1049_13:0:0_3:0:0_309 147 rotavirus   984 60  4S66M   =   494 -556    TAAGATTAACCATCTACACATGACCCTCTATGAGCACAATAGTTAAAAGCTAACACTGTCAAAATCCTAA  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  MD:Z:60A5   RG:Z:S2 NM:i:1  AS:i:61 XS:i:0
rotavirus_549_1057_2:0:0_7:0:0_27c  147 rotavirus   988 60  68M2S   =   549 -507    ACCATCTACACAGGACACTCTATGAGCACAATACTTTAAAGCTAACTCTGTCAAAATCCTAAATGGCTTT  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++  MD:Z:12T3C16G2A9A9A11   RG:Z:S2 NM:i:6  AS:i:38 XS:i:0

ADD COMMENT • link 7.0 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Hi Pierre,

Thank you for the script. It has helped me a lot to select reads that contain a substitution mutation in my cDNA sequencing data. However, I would like to do the same thing for selecting reads that have indel mutations (both 1 bp and multiple bp), and splice site mutations. The splice site mutations are located intronically, and thus not always present in my reads, so a selection for these would have to be based on splicing (such as exon skipping). Do you know a way to do this using your program?

Thank you!

ADD REPLY • link 5.0 years ago by lena.duchateau ▴ 10

1

Entering edit mode

Hi Pierre, Thank you very much for making this tool. Every time I need a not-so-standard tool there is one piece of software in jvarkit that does specifically what I want. This time I am interested in selecting reads supporting one allele, but I'd like to know how to select several positions specifying one base each time. I've tried this script.js, based on a simple modification of your script.js:

final String contig = "scaffold_3";
final int[] mutpos = {9379104, 9379106, 9379131, 9384562, 9414473, 9416683, 9416773, 9417122, 9417182, 9417260};
final char[] mutbase = {'T', 'G', 'A', 'T', 'T', 'A', 'A', 'G', 'C', 'C' };
int arrayLength = mutbase.length;
if(record.getReadUnmappedFlag()) return false;
if(!record.getContig().equals(contig)) return false;
for (int i = 0; i < arrayLength; i++){
  if(record.getEnd() < mutpos[i]) return false;
  if(record.getStart() > mutpos[i]) return false;
  int readpos = record.getReadPositionAtReferencePosition(mutpos[i]);
  if(readpos<1) return false;
  readpos--;
  final byte[]    bases= record.getReadBases();
  if(bases[readpos]==mutbase[i]) return true;
  return false;
}
return false;

However, this ends up giving me less entries than using a single position.

Please note that this is my very first attempt to "code" anything in java. This does run, which I did not expect, but the results are not what I want. Is there a way to ask samjdk to return a read whenever a mutpos matches its corresponding mutbase?. I don't care if I got duplicated reads, I can sort and uniq them later (I'm testing some hundreds of SNPs, so the sort should be rather fast).

Of course a different option would be to run this once per SNP in a loop (that is, generate a script.js per SNP), but I'd rather run all at once.

Thank you again for writing jvarkit.

EDIT: Apparently the issue came from the return statement, that ends the for loop and finishes the script. I've modified the script and it works, I'm just leaving it here for anyone who comes in the future (maybe even myself!)

final String contig = "Your_Contig";
final int[] mutpos = {Pos1, Pos2, Pos3, PosN};
final char[] mutbase = {'BaseForPos1', 'BaseForPos2', 'BaseForPos3', 'BaseForPosN' };
int arrayLength = mutbase.length;
if(record.getReadUnmappedFlag())  return false;
if(!record.getContig().equals(contig)) return false;
ArrayList<htsjdk.samtools.SAMRecord> ret = new ArrayList<htsjdk.samtools.SAMRecord>();
for (int i=0; i < arrayLength; i++){
  if(record.getEnd() < mutpos[i]) continue;
  if(record.getStart() > mutpos[i]) continue;
  int readpos = record.getReadPositionAtReferencePosition(mutpos[i]);
  if(readpos<1) continue;
  readpos--;
  final byte[]    bases= record.getReadBases();
  if(bases[readpos]==mutbase[i]) {
    ret.add(record);
  }
  continue;
}
return ret;

I hope this is useful for someone else.

ADD REPLY • link 4.3 years ago by Carles Borreda ▴ 10

0

Entering edit mode

Hello Pierre,

Your script is very useful in my research. And is there a way to dissect with multiple-base pattern in a specified region from BAM file? (Like extract alignment with chr19:100000-100003 as ATCG in reverse strand). Another question is I noticed that in the manual of samjdk there is a specification: --pair [20171110] PAIR-MODE .The signature of java function is public Object apply(final List<SAMRecord> records). This function must return true to accept the whole list, false to reject eveything, or another List<SAMRecord>.Input MUST be sorted on query name using picard SortSam (not samtools sort https://github.com/samtools/hts-specs/issues/5 ). Does that mean when I using pairend BAM files as input, I should use picard to sort them first?

Thank you!

ADD REPLY • link 5.7 years ago by Yijun Tian ▴ 20

0

Entering edit mode

And is there a way to dissect with multiple-base pattern in a specified region from BAM file?

yes, test all the positions

Does that mean when I using pairend BAM files as input, I should use picard to sort them first?

yes

ADD REPLY • link 5.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi Pierre,

I have tried SortSam my BAM file by queryname with :

java -jar ~/picard/build/libs/picard.jar SortSam I=22RV1-DHT_CTCF.bam O=22RV1-DHT_CTCF.psort.bam SORT_ORDER=queryname

And when add --pair to the script you provide in samjdk and run as :

 java -jar dist/samjdk.jar --pair -f hg38-rs12611084-A.js *.psort.bam > output.bam

The program warning:

[WARN][InMemoryCompiler]/SamJdkCustom1838586069.java:18: error: cannot find symbol if(record.getReadUnmappedFlag()) return false; ^symbol: variable record location: class SamJdkCustom1838586069 /SamJdkCustom1838586069.java:19: error: cannot find symbol if(!record.getContig().equals(contig)) return false; ^ symbol: variable record location: class SamJdkCustom1838586069 /SamJdkCustom1838586069.java:20: error: cannot find symbol if(record.getEnd() < mutpos) return false; ^ symbol: variable record location: class SamJdkCustom1838586069 /SamJdkCustom1838586069.java:21: error: cannot find symbol if(record.getStart() > mutpos) return false; ^ symbol: variable record location: class SamJdkCustom1838586069 /SamJdkCustom1838586069.java:22: error: cannot find symbol int readpos = record.getReadPositionAtReferencePosition(mutpos); ^ symbol: variable record location: class SamJdkCustom1838586069 /SamJdkCustom1838586069.java:25: error: cannot find symbol final byte[] bases= record.getReadBases(); ^ symbol: variable record location: class SamJdkCustom1838586069 6 errors [SEVERE][InMemoryCompiler]compile.call() failed for custom class SamJdkCustom1838586069 java.lang.RuntimeException: compile.call() failed for custom class SamJdkCustom1838586069 at com.github.lindenb.jvarkit.lang.InMemoryCompiler.compileClass(InMemoryCompiler.java:230) at com.github.lindenb.jvarkit.tools.samjs.SamJdk.doWork(SamJdk.java:566) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1145) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1303) at com.github.lindenb.jvarkit.tools.samjs.SamJdk.main(SamJdk.java:785) [SEVERE][SamJdk]Cannot compile custom class SamJdkCustom1838586069 java.lang.RuntimeException: Cannot compile custom class SamJdkCustom1838586069 at com.github.lindenb.jvarkit.lang.InMemoryCompiler.compileClass(InMemoryCompiler.java:234) at com.github.lindenb.jvarkit.tools.samjs.SamJdk.doWork(SamJdk.java:566) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1145) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1303) at com.github.lindenb.jvarkit.tools.samjs.SamJdk.main(SamJdk.java:785) Caused by: java.lang.RuntimeException: compile.call() failed for custom class SamJdkCustom1838586069 at com.github.lindenb.jvarkit.lang.InMemoryCompiler.compileClass(InMemoryCompiler.java:230) ... 4 more [INFO][Launcher]samjdk Exited with failure (-1)

I am sorry that I know very little about JAVA. Can you help me modify above script to let it run with pairend alignment?

Thank you!

ADD REPLY • link 5.7 years ago by Yijun Tian ▴ 20

1

Entering edit mode

1) you're not using the latest version of the tool

2) in pair mode, the signature of the function that's need to be implemented contains a variable named records not record.

ADD REPLY • link 5.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi Pierre,

Based on your answer. I've tried my best to write below script. Do you think it's right to using it for extracting the alignment with SNPx?

final String contig= "chr17";
final int mutpos = 38256322;
final char mutbase='T';
return records.stream().anyMatch(R->R.getReadUnmappedFlag() == false && R.getContig()== contig && R.getEnd() >= mutpos&& R.getStart() <= mutpos && R.getReadBases()[R.getReadPositionAtReferencePosition(mutpos)]==mutbase);

Thank you!

I just realized that I can simply run the original script for single end data and extract the read pairs from the original BAM file by grep the reads name. That feels more comfortable. Thank you!

Yijun

ADD REPLY • link 5.7 years ago by Yijun Tian ▴ 20

score 0 · Answer 2 · 2017-11-16

The mpileup function is the best tool for this kind of job..try using the latest version though (1.6) as they have a new switch (qname) that allows you to get the read names of the piled up reads as well..the usage is pretty simple, many examples are present on the web..you can Google it out.. otherwise, the main parameters to consider is base quality, mapping quality, depth at the particular position and also the region, which is one base in your case..all these options will be present in the mpileup helpfile itself.

score 0 · Answer 3 · 2017-11-16

0

Entering edit mode

7.0 years ago

chen ★ 2.5k

You can try MutScan, input the original FASTQ file and a VCF file that contains your variants.

ADD COMMENT • link 7.0 years ago by chen ★ 2.5k

score 0 · Answer 4 · 2017-11-16

0

Entering edit mode

7.0 years ago

cpad0112 21k

search for sequence motifs in bam: qmotif (https://sourceforge.net/p/adamajava/wiki/qMotif%201.0/) and output is bam with matching reads and xml file with matching information.

ADD COMMENT • link 7.0 years ago by cpad0112 21k

score 0 · Answer 5 · 2017-11-16

One more answer... If you are interested in visually inspecting these reads, the program I've written, ASCIIGenome, has a command, filterVariantReads, to filter for reads that are variant at a given position or interval.

So, first load genome and data:

ASCIIGenome -fa genome.fa aln.bam

Go to region of interest and filter for variant reads, say your SNP is at position chr7:150

goto chr7:100
filterVariantReads -r 150

You can then save the selected reads to file with the print command.