Getting reads that end at a digest site from a sam/bam file
2
0
Entering edit mode
7.8 years ago
Jautis ▴ 580

Hi, I have bisulfite converted sequencing reads and, from my bam file, would like to select reads that end at a Msp1 digest site (C^CGG). The way I would think to go about doing that is making a bed file of CCGG sites in the genome (from the fasta file) and then selecting reads from the bam file that end at those sites. However, I'm not sure how to do either part of this.

Can anyone help me out or do you have better ideas? Thanks!

bisulfite digest • 1.8k views
ADD COMMENT
0
Entering edit mode
7.8 years ago

You presumably want to create a BED (not BAM) file of the cut sites and find reads that overlap. Of course you'll get reads that span over a site, but that'll be a minor portion.

ADD COMMENT
0
Entering edit mode

Yes, that would be correct- I mistyped in the initial question and have edited it.

How would I go about constructing that bed file? As for getting the reads that span that site, would I do that by reformatting the bam file into a bed-like format (adding the length of the read to the inital site so I check the end) or is there a tool I could use on the bam file itself?

ADD REPLY
0
Entering edit mode

For generating the BED file, you can use biopython, which can determine cut sites in DNA.

For finding the reads, just use samtools with the BED file and the BAM file. If you want anything more exact, then write a little script in pysam, though that'd be largely overkill.

ADD REPLY
0
Entering edit mode

I don't think samtools does what I want it to here. I want reads that end at an Msp1 digest site, not those that include a digest site anywhere within the read.

Using biopython works perfectly for the first part, thank you!

ADD REPLY
0
Entering edit mode

I know, but you'll probably find that 99% of the overlapping alignments end on a cut site, so you can save yourself the added time and ignore the 1% extra. If the extra precision is important then you can code something with pysam.

ADD REPLY
0
Entering edit mode
7.8 years ago

using awk

 curl -s "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/working/20151021_3.5kb_jumping/SRS000213.bwabacktrack_GRCh38Primary.20151001.YRI.jumping_library.bam" | samtools view -h | awk -F '\t' '($0 ~ /^@/ || $10 ~ /CCGG$/)'

Using samjs

curl -s "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/working/20151021_3.5kb_jumping/SRS000213.bwabacktrack_GRCh38Primary.20151001.YRI.jumping_library.bam" | java -jar ~/src/jvarkit-git/dist/samjs.jar -e 'record.getReadString().endsWith("CCGG")'
(...)
SL-HAC:C7FRWANXX150920:C7FRWANXX:5:2211:11368:69901 161 1   11741   0   25M 12  129355703   0   ACTGATGATTTTGCTGCATGGCCGG   FF
ADD COMMENT
0
Entering edit mode

Hi Pierre, because the library was digested I don't expect the reads to actually end with the full Msp1 cut site and would like to retain a read that only had the C or CCG. Additionally, the library is bisulfite converted, so just looking for CCGG at the ends of a read won't work. Rather, I believe I need to (i) generate a list of CCGG sites and (ii) select for reads that end within the cut site, but I'm not sure how to do either task

ADD REPLY

Login before adding your answer.

Traffic: 2235 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6