Extracting nucleotide immediately prior to mapped read

1

Entering edit mode

9.6 years ago

graeme.thorn ▴ 110

Hi,

I have mapped (paired-end) RNA-seq reads (so BAM file information) and I want to extract the nucleotide in the genome immediately prior to the (forward) mapping (essentially to test whether a RNAse digest has worked correctly).

As it is in BAM format, then extracting the position of the read (chromosome and location) is straightforward (just samtools view it, and process the line output for the relevant field), but I was wondering if there was a quicker way than brute-force (i.e. extracting, sorting, then reading from the genome FASTA file).

RNA-Seq sequence • 3.4k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.6 years ago by graeme.thorn ▴ 110

1

Entering edit mode

Your time spent properly processing the SAM format will probably outweigh indexed FASTA access overhead.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Matt Shirley 10k

4

Entering edit mode

9.6 years ago

Pierre Lindenbaum 166k

Using my tool samslop and bioalcidae:

$ java -jar dist/samslop.jar -c -m 1 -M 1 -r ref.fa  S1.bam |\
java -jar dist/bioalcidae.jar -F SAM -e  'while(iter.hasNext()) {var read=iter.next(); if(read.getReadUnmappedFlag()) continue; out.println(read.getReferenceName()+":"+read.getAlignmentStart()+"-"+read.getAlignmentEnd()+" "+read.getReadString().substr(0,1)+"/"+read.getReadString().substr(read.getReadLength()-1));}' | uniq | head

rotavirus:1-71 G/A
rotavirus:1-72 G/A
rotavirus:1-53 G/A
rotavirus:1-72 G/A
rotavirus:1-65 G/A
rotavirus:1-68 G/A
rotavirus:2-73 G/T
rotavirus:3-74 C/A
rotavirus:3-69 C/T
rotavirus:3-74 C/A

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

9.6 years ago

Brian Bushnell 20k

I think the easiest approach would be to just add an "N" to both ends of the reads before you map them. Then you can see the reference base in the MD tag.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Brian Bushnell 20k

2

Entering edit mode

9.6 years ago

Matt Shirley 10k

Here is some skeleton code that can help you.

	from simplesam import Reader
	from pyfaidx import Fasta

	with Reader(open('library.bam', 'r')) as sam_file, Fasta('hg38.fa', as_raw=True) as hg38:
	for read in sam_file:
	if read.mapped:
	# might also want to handle read.reverse here
	prior_pos = read.pos - 2 # read.pos is 1-based
	prior_base = hg38[read.rname][prior_pos]

view raw answer.py hosted with ❤ by GitHub

Depending on the question you are asking, you might find the Counter in the collections module useful for summarizing the results.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Matt Shirley 10k

0

Entering edit mode

9.6 years ago

Sean Davis 27k

At least R, python, and perl have indexed fasta capabilities that allow you to read a "slice" from a fasta file or files. You could also look at using the Biostrings package in Bioconductor. Finally, if you have the UCSC kent tools, you can convert your fasta to .2bit format and then use the kent tools to very quickly get bases from the genome.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Sean Davis 27k

Login before adding your answer.