Question

How can I extract part of the DNA sequences that align to a specific region in the reference from a .bam file?

0

Entering edit mode

3.5 years ago

LDT ▴ 340

Dear all,

I have a bam file from which I want to extract the exact part of DNA sequences that map within the following region 54-78 to my reference .

I have tried so far unsuccessfully bedtools, samtools, custom scripts and I was wondering if there is a quick way to do it with RSamtools or any other tool in R.

samtools command:

samtools faidx Reference_barcodes.fasta "Reference_barcodes:54-78" | grep -v '^>' | tr -d '\n' && echo " REF" && java -Xmx200g -jar /data/biosoftware/sam4weblogo/jvarkit/dist/sam4weblogo.jar  --interval "Reference_barcodes:54-78" -F tabular test_4_seqs_sorted.bam

bedmap --echo --fraction-map 1 <(bam2bed < test_4_seqs_sorted.bam) <(echo -e "Reference_barcodes\t54\t78") > answer.bed

Here is a link to my data which is tiny, only 4 sequences, https://1drv.ms/u/s!ArOzUuixE-mggfskZPX_Y0cpyQH4RA?e=hAa0W3 and this is the desired result that I want to take out

TACGGTTATATTGACAGACCGAGGG
TAATTCGAAACCATAAGGCTCGCAA
TACGGTTATATTGACAGACCGAGGG
TAATTCGAAACCATAAGGCTCGCAA

I am at a desperate point so any serious help and guidance is more than appreciated and I am happy to tip for it if possibly though "Buy me a coffee" or patreon format

bedtools alignment samtools bam sequences • 1.9k views

ADD COMMENT • link 3.4 years ago by LDT ▴ 340

1

Entering edit mode

What you are trying to do is not trivial, and I don't think you'll find a simple command to do that (but maybe I'm wrong). If I had to do that, I would write my own script using some library for parsing bam files. You'll need to consider the alignment start position and the CIGAR string, so you can extract matches within the required range. Sorry this is not really an answer, but more of a general direction. Good luck.

ADD REPLY • link 3.5 years ago by liorglic ★ 1.5k

0

Entering edit mode

3.5 years ago

LauferVA 4.8k

Ms., or Mr., LDT - please see: https://pangenome.github.io/

VCF files separate genetic variants from the surrounding sequence in which they are found.

However, these are not the only way to represent and view genomes. Genome graphs ( <---google this) do not do this. It is not a direct answer to your question, but I am guessing that you will love these as much as I do. If, after looking them over, you decide they don't help, let me know and I will addend this answer.

ADD COMMENT • link 3.5 years ago by LauferVA 4.8k

0

Entering edit mode

Thank you Vincet! I had a look but I still have not found a solution. Thank you for your time

ADD REPLY • link 3.4 years ago by LDT ▴ 340

0

Entering edit mode

OK. ill write one.

what genome build ? 1-based or 0-based?

ADD REPLY • link 3.4 years ago by LauferVA 4.8k

0

Entering edit mode

Dear Vincent, sorry for my late reply! Forgive me for my ignorance what does genome build mean? 0 or 1

I did a google but I did not get it. Here is an example of my bam file https://1drv.ms/u/s!ArOzUuixE-mggfskZPX_Y0cpyQH4RA?e=hAa0W3 which comes after mapping amplicons to a reference. The organism I am working with is bacteria

ADD REPLY • link 3.4 years ago by LDT ▴ 340

score 2 · Accepted Answer · 2022-03-10

2

Entering edit mode

3.4 years ago

LDT ▴ 340

After so long time and many battles I found a solution that works for me Thanks to the Bioconductor people and especially @hpages

library(GenomicAlignments)
stack <- stackStringsFromBam("test_4_seqs_sorted.bam", param=GRanges("Reference_barcodes:54-78"))

ADD COMMENT • link 3.4 years ago by LDT ▴ 340