Extracting nucleotide immediately prior to mapped read
4
1
Entering edit mode
8.9 years ago
graeme.thorn ▴ 100

Hi,

I have mapped (paired-end) RNA-seq reads (so BAM file information) and I want to extract the nucleotide in the genome immediately prior to the (forward) mapping (essentially to test whether a RNAse digest has worked correctly).

As it is in BAM format, then extracting the position of the read (chromosome and location) is straightforward (just samtools view it, and process the line output for the relevant field), but I was wondering if there was a quicker way than brute-force (i.e. extracting, sorting, then reading from the genome FASTA file).

RNA-Seq sequence • 2.7k views
ADD COMMENT
1
Entering edit mode

Your time spent properly processing the SAM format will probably outweigh indexed FASTA access overhead.

ADD REPLY
4
Entering edit mode
8.9 years ago

Using my tool samslop and bioalcidae:

$ java -jar dist/samslop.jar -c -m 1 -M 1 -r ref.fa  S1.bam |\
java -jar dist/bioalcidae.jar -F SAM -e  'while(iter.hasNext()) {var read=iter.next(); if(read.getReadUnmappedFlag()) continue; out.println(read.getReferenceName()+":"+read.getAlignmentStart()+"-"+read.getAlignmentEnd()+" "+read.getReadString().substr(0,1)+"/"+read.getReadString().substr(read.getReadLength()-1));}' | uniq | head

rotavirus:1-71 G/A
rotavirus:1-72 G/A
rotavirus:1-53 G/A
rotavirus:1-72 G/A
rotavirus:1-65 G/A
rotavirus:1-68 G/A
rotavirus:2-73 G/T
rotavirus:3-74 C/A
rotavirus:3-69 C/T
rotavirus:3-74 C/A
ADD COMMENT
2
Entering edit mode
8.9 years ago

I think the easiest approach would be to just add an "N" to both ends of the reads before you map them. Then you can see the reference base in the MD tag.

ADD COMMENT
2
Entering edit mode
8.9 years ago

Here is some skeleton code that can help you.

Depending on the question you are asking, you might find the Counter in the collections module useful for summarizing the results.

ADD COMMENT
0
Entering edit mode
8.9 years ago

At least R, python, and perl have indexed fasta capabilities that allow you to read a "slice" from a fasta file or files. You could also look at using the Biostrings package in Bioconductor. Finally, if you have the UCSC kent tools, you can convert your fasta to .2bit format and then use the kent tools to very quickly get bases from the genome.

ADD COMMENT

Login before adding your answer.

Traffic: 2168 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6