Extract Reads From A Bam File That Fall Within A Given Region
4
32
Entering edit mode
12.4 years ago
abi ▴ 390

Hi all

This question could be considered as a follow up of this discussion. http://www.biostars.org/post/show/3407/how-to-extract-reads-from-bam-that-overlap-with-specific-regions/#3414 What I need is to extract reads from bam file that fall only within a given region (not overlap the given region), the region being in the form of a gff file or bed file. Overlapping reads could be extracted by several methods (as in the discussion mentioned or BEDTools). The idea is to try to be pretty sure of excluding 5´ UTRs in the process of detecting intergenic transcripts. I saw a tool in BamUtil (http://genome.sph.umich.edu/wiki/BamUtil) called "writeRegion" which would pretty much do what I want. Somehow could not get this running for my dataset. Was wondering if you guys might have an "R" or some other solution for this. Thanks in advance Abi

bam • 172k views
ADD COMMENT
0
Entering edit mode

Heys, is there any option where you add a fasta file containing your regions of interest??

ADD REPLY
1
Entering edit mode

No, since fasta contains no positional information.

ADD REPLY
0
Entering edit mode

Thanks! Then how can I obtain the positional information from a fasta file?

ADD REPLY
1
Entering edit mode

Map to a reference genome. The process is called alignment. Tools are e.g. bowtie2 or bwa mem. If you need more guidance then please open a new questions, describing the problem and providing details what exactly you want to do and which data you have.

ADD REPLY
35
Entering edit mode
12.4 years ago

You can extract mappings of a sam/bam file by reference and region with samtools. For example:

samtools view input.bam "Chr10:18000-45500" > output.bam

That would output all reads in Chr10 between 18000-45500 bp.

ADD COMMENT
0
Entering edit mode

Thanks Dk for your answer, in your example the output will be all reads starting anywhere between 18000-45500. What I want is to get out only the reads which also end in this region. For example starting at 18000 or higher and ending at 45500 or lower. Abi

ADD REPLY
0
Entering edit mode

That is confusing. Would you please explain little bit more, which reads do you want in output?

ADD REPLY
0
Entering edit mode

I am sorry if I am confusing you guys. Lets say I have a 4 reads with the following start and end coordinates aligned on to the chromosome.

start coordinates of the reads: 900, 1000, 1100, 1900

end coordinates of the reads: 1200, 1500,1700, 2300

My region of interest is 1000:2000

I want my output to be in this case the second and third reads starting and ending within my region of interest.

The first and last reads should be excluded because even though they overlap my region of interest, they extend away from my region (1000:2000). I hope my question is clearer now. Abi

ADD REPLY
0
Entering edit mode

do you have spliced reads as well?

ADD REPLY
0
Entering edit mode

he wants the read with overlaps "any" at the left-end and "within" at the right end. If you consider the position 18000-45000, he needs all reads with overlap of at least 1 base to 18000, but all reads should end less than or equal to 45000.

ADD REPLY
0
Entering edit mode

Yep Arun thats what I want and BTW, I do not have spliced reads!

ADD REPLY
2
Entering edit mode

abi, this should be possible with R and IRanges. For loading a bam file, refer to this post. If you have questions about using IRanges, let me know. I'll post an example.

ADD REPLY
0
Entering edit mode

hi kao , can u pls tell me how to index this bam file because wen I tried to make index it showed error like: Bam header is absent. Also can u suggest any tool r the way to merge the overlapping region in bam file and make into single fasta file.

ADD REPLY
0
Entering edit mode

Use "index" from samtools, described here.

samtools index test.bam
ADD REPLY
0
Entering edit mode

How can you randomly extract reads from two bam files to simulate sample contaminations? I am trying to simulate sample contamination for different level of dilution for NGS samples. Suppose I have two bam files for SampleA and SampleB. I want to generate 5 contaminated samples at dilution of 10%, 20%, 30%,40% and 50% of those two samples. I understand that I should extract reads from one of the two bam files at the given dilution percentage and reassign to the other bam file, but I don't know exactly how to do this. Could you please explain me the procedure? Thanks

ADD REPLY
0
Entering edit mode

After getting the reads , which falls in Chr10 between 18000-45500 bp, how can i sequence of that gene

ADD REPLY
0
Entering edit mode

Please elaborate on what, exactly, you want? You simple want the sequence for this region chr10:18000-45500?; and for which genome version?

ADD REPLY
0
Entering edit mode

yes i want the sequence of this region. i have 2 rna-sample, and in both samples i want to know the size of the transcript , which is present in chr10:18000-45500 and i have mouse genome and gff from ensembl-99

ADD REPLY
0
Entering edit mode

WARNING THIS WILL CREATE HEADERLESS FILE POSSIBLY LEADING TO DOWNSTREAM ERRORS: _e.g._

$ samtools sort -n HG002_Chr22.bam -o HG002_Chr22_sort.bam
[E::sam_parse1] no SQ lines present in the header
samtools sort: truncated file. Aborting


$ bedtools bamtofastq -i HG002_Chr22.bam -fq HG002_Chr22_R1.fq -fq2 HG002_Chr22_R2.fq
[E::sam_parse1] missing SAM header
[W::sam_read1] Parse error at line 1
ADD REPLY
3
Entering edit mode

The solution is 8 years old. In today’s syntax it would

samtools view -o subset.bam in.bam region(...)

As -o autodetects the .bam suffix the header is automatically included. Alternatively, if you want to pipe it and -o is not to be set then use -b triggering bam output and implying -h (so header is in).

ADD REPLY
0
Entering edit mode

Hi, May I ask if the reads are paired and the two ends may come from different regions, whether this command extract reads with only one end within that region or both two ends within that region?

ADD REPLY
0
Entering edit mode

Hi Dk Could you please tell me what should I do if I have a list of ChrX:XXXXX-XXXXX index and I want gather the reads to a same output.bam? Thanks a lot.

ADD REPLY
7
Entering edit mode
7.4 years ago
KevinL ▴ 70

I think the options have changed since this reply was written if you want the default headerless sam file then use the command as per Damian Kao

samtools view input.bam "Chr10:18000-45500" > output.sam

if it's a bam output that you want you will need a '-b'

samtools view -b input.bam "Chr10:18000-45500" > output.bam
ADD COMMENT
2
Entering edit mode

Do not forget to include the -h flag whilst doing this, i.e., if you want to retain the SAM header, which will be required for many downstream applications.

-h include header in SAM output

ADD REPLY
1
Entering edit mode

Headers are automatically included when the output format is BAM since BAM requires the header to be included.

ADD REPLY
8
Entering edit mode

Without -h output.BAM has a header but IGV will not show any reads. In my case was necessary to add -h to display output.bam correctly

 samtools view -b -h input.bam "Chr10:18000-45500" > output.bam
ADD REPLY
5
Entering edit mode
12.4 years ago
abi ▴ 390

Thanks guys, Actually the answer was there in Michael´s explanation in http://www.biostars.org/post/show/3407/how-to-extract-reads-from-bam-that-overlap-with-specific-regions/#3414 In his example, countOverlaps(genes, reads, type = "within") would do what I want. If type is within, the query interval must be wholly contained within the subject interval.

Abi

ADD COMMENT
0
Entering edit mode

yes, that's right. This is what I thought you'd want to go and look up and come back with questions. Seems the usage is already shown here. Nice!

ADD REPLY
2
Entering edit mode
6.5 years ago

Via BEDOPS bedmap --fraction-map 1 to return only reads which fall entirely within intervals in intervals.bed:

$ bedmap --echo --fraction-map 1 <(bam2bed < reads.bam) intervals.bed > answer.bed

If you want to do some ad-hoc search:

$ bedmap --echo --fraction-map 1 <(bam2bed < reads.bam) <(echo -e "chrZ\t1234\t5678") > answer.bed
ADD COMMENT
0
Entering edit mode

I am trying to apply this solution to my data without any success. I want to extract reads from a bam file (alignment to 9kb reference) within position 3328-5551. I have tried second option typing JX4\t3328\t5551 whereas JX4is my reference name? Am I doing it right? Could someone please help me to understand how intervals.bed should look like and how to generate it?

ADD REPLY
0
Entering edit mode

No, you would want to use the chromosome or contig name(s) for searching, not the reference name.

If you don't know these names, here is one way to find out:

$ bam2bed < reads.bam > reads.bed
$ bedextract --list-chr reads.bed
...

The ... will be a list of chromosome or contig names. When you try an ad-hoc search for reads of interest, you would need to use that chromosome/contig name with the interval start and stop positions of interest (3328-5551).

Note that, if you convert reads to BED using the above command, you don't need to waste time converting them again. You can just pass the file directly, i.e.:

$ bedmap --echo --fraction-map 1 reads.bed intervals.bed > answer.bed
ADD REPLY
0
Entering edit mode

Thx. Alex for your response. I've tried:

bam2bed < reads.sorted.bam > reads.bed
bedextract --list-chr reads.bed

The output was JX472009. Then I used a command:

bedmap --echo --fraction-map 1 <(bam2bed < reads.sorted.bam) <(echo -e "JX472009\t3328\t5551") > answer.bed

Then I changed answer.bed to bam format. I sorted it and visualized in Tablet. As you can see I didn't mange to extract reads between 3328 - 5551. Moreover types of nucleotides are not defined. Do you know any solution for that?

tablet

ADD REPLY
0
Entering edit mode

The command you entered will return all reads which entirely cover the region JX472009:3328-5551.

In other words, a read may overhang upstream or downstream of the edges of that region.

Are you after a different answer, such as all reads contained within the bounds of JX472009:3328-5551?

You could change --fraction-map to --fraction-ref to answer that question:

bedmap --echo --fraction-ref 1 reads.bed <(echo -e "JX472009\t3328\t5551") > answer.bed

This changes the overlap requirement from the map set (the echo -e bit) to the reference set (your reads).

ADD REPLY
0
Entering edit mode

I am still with the same problem. I am doing everything according to your suggestions but in the end I am facing a defeat. I even tried different combinations: --echo --fraction-ref 1 and --fraction-map 1 and every time I get the same output (each time answer.bed has the same file size ). I also tried different echo -e "JX472009\t3328\t5551" limits, e.g. 200-9000 and I still don't see any changes. Each time the output is the same file, based on their size. I assume that when I extract reads the output file should have different size based on the applied parameters. Regardless my changes each time when I vizualize my results in tablet I am experiencing he same output as above.

Maybe something is wrong with my input. I am using sorted bam files. Any hints?

ADD REPLY
0
Entering edit mode

Can you post your bam file somewhere I can download it?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
$ bam2bed < reads.sam.bam > reads.sam.bam.bed
$ bedmap --echo-map --fraction-map 1 --multidelim '\n' <(echo -e "JX472009\t3328\t5551") reads.sam.bam.bed | sort-bed - > answer.bed

Here is a sanity check that verifies that the reads in answer.bed are contained entirely within the query region:

$ bedops --merge answer.bed                                                                                            
JX472009    3328    5551

This could also be confirmed via visualization of answer.bed.

I can't speak to nucleotides or what your visualization tool shows for IDs. I'll have to leave that to you to figure out.

One suggestion is to perhaps use awk or similar on reads.sam.bam.bed, before filtering for reads of interest.

The awk step could be used to swap columns around. If the sequence information is in another field, it might be useful to swap it into the ID field (fourth column). For example:

$ awk -v FS="\t" -v OFS="\t" '{a=$12; $12=$4; $4=a; print $0}' reads.sam.bam.bed > reads.sam.bam.nuc4.bed
$ bedmap --echo-map --fraction-map 1 --multidelim '\n' <(echo -e "JX472009\t3328\t5551") reads.sam.bam.nuc4.bed | sort-bed - > answer.bed
ADD REPLY
0
Entering edit mode

Thanks Alex. I managed to filter them :) Another problem is my bam output after answer.bed convertion to bam. Final sowftware that I use do not recognize this file as bam format. It prints:

input file is not in SAM or BAM format. net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 1, Read name TAATCACCAGCCTGACTGGCCGGGACAAGAACCAAGTGGAGGGCGAGGTCCAGATTGTGTCAACTGCTGCCCAAACTTTCCTGGCGACGTGCATCAACAGGGTATGCTGGACTGTCTACCACGGGGCCGGAACGAGGACCATAGCATCACCCAAGGGTCCCGTAATCCAGATGTATACCAATGTAGACCAAGACCTCGTGGGCTGGCCTGCTCCGCAGGGTTCCCGCTCGTTGACGCCCTGCACCTG, Zero-length read without FZ, CS or CQ tag

I think there is an isse is with bed to bam conversion because I see that sequences are in bam headers.

ADD REPLY

Login before adding your answer.

Traffic: 2689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6