How to subset large BAM files specifically/ extract specific subsets?
0
0
Entering edit mode
14 months ago
ella • 0

Dear community,

I have huge BAM files (mapped using BBTools+coordinate-sorted via SAMTOOLS), which I used for variant calling. Now I would like to manually check specific positions/ variants in the mapped data using Tablet and/or IGV. Unfortunately, the BAM files are extremely large, rendering it impossible to download them from the HPC I am using and check them on my laptop.

My reference is a huge multi-fasta file containing thousands of transcripts.

To make the files more handy, I was aiming to extract only those portions of the BAM file, that correspond to my transcripts (~chromosomes) of interest.

I tried

samtools view -bh input.sorted.bam chr1 chr2 > output.bam

as suggested in a previous post (Subset Bam File According To List Of Contigs) but no reads were showing up in IGV or Tablet, when using either the complete fasta reference or a subset as genome input.

Also found this post: How To Split A Bam File By Chromosome. I donwloaded and installed bamtools, because the

bamtools split 

option was suggested. Running bamtools split --help, I did not find any option to limit the output of the function to a specific subset of the reference contigs; splitting into all reference contigs will be an extreme amount of .bam subfiles.

After trying to find an option on my own, I would now be very thankful for any handy suggestions!

Thanks in advance,

Ella

bamtools RNAseq samtools BAM • 2.7k views
ADD COMMENT
1
Entering edit mode

What is the output of samtools view -H input.sorted.bam? My guess is that your chromosomes are simply not called chr1, chr2... but maybe 1,2, or so...because the command itself is fine.

ADD REPLY
0
Entering edit mode

Thank you for the fast answer.

The output looks like this:

@SQ     SN:Pp3c26_12460V3.1 pacid=32917486 locus=Pp3c26_12460 annot-version=v3.3        LN:1307

Is the problem the complexity of the long headers in my multi-fasta file?

ADD REPLY
0
Entering edit mode

This header says that there is a single chromosome/contig in the reference you aligned against, and this is called Pp3c26_12460V3.1 pacid=32917486 locus=Pp3c26_12460 annot-version=v3.3. THere is no chr1, chr2...

ADD REPLY
0
Entering edit mode

I´m sorry for my very unclear message, it contains thousands of lines with the format that I texted, each line looking like this:

@SQ    SN:'sub-fasta-header"     LN:'a number'

@SQ    SN:'sub-fasta-header"     LN:'a number'

 @SQ    SN:'sub-fasta-header"     LN:'a number'

@SQ    SN:'sub-fasta-header"     LN:'a number'

 .....

"Pp3c26_12460V3.1 pacid=32917486 locus=Pp3c26_12460 annot-version=v3.3" is the header of one of the fastas in my multi-fasta reference.

Sorry, I hope I explained it better now.

ADD REPLY
1
Entering edit mode

but no reads were showing up in IGV or Tablet, when using either the complete fasta reference or a subset as genome input.

My guess is you may have created a headerless BAM after the subset. Be sure to include the header. Do you see the headers when you do samtools view -H subset.bam?

ADD REPLY
0
Entering edit mode

Thank you for the fast answer.

I think I probably did a mistake in subsetting, as I did probably not use the whole header but only a part of the header. Using samtools view -H subset.bam produces a huge output, so it seems that subsetting didn´t work. I´ll try to subset using the whole header in single quotes and check again!

ADD REPLY
0
Entering edit mode

I´m not able to create a subset. :( I tried tons of versions of samtools view, for example the following:

On the one hand, I tried to use my subsetted fasta file as input, using the -T option (How to subset bamfile based on chromosome IDs):

On the other hand, I tried to use the fasta headers of interest in a text file or give all headers in quotation marks as suggested here: Subset Bam File According To List Of Contigs

And even to extract the bam corresponding to only one of the contigs I´m interested in; using single long headers in single quotation marks (e.g. 'Pp3c26_12460V3.1 pacid=32917486 locus=Pp3c26_12460 annot-version=v3.3') via samtools view -bh bamfile.sam-fixmate-sorted-deduped.bam 'Pp3c17_14230V3.4 pacid=32906262 locus=Pp3c17_14230 annot-version=v3.3' > test.bam

Either I get some error, or I get a test.bam, which, when I run samtools view -H subset.bam, still contains all the headers, as the source bam file.

No clue what I am doing wrong...

ADD REPLY
0
Entering edit mode

Result file will still contain the original headers. Issue appears to be that you are unable to use those long names with spaces.

ADD REPLY
0
Entering edit mode

Okay, I understand. I was also not able to succeed with the BED file. I will now reduce the headers in my reference to the IDs (e.g. "Pp3c17_14230V3.4") and re-run mapping and calling with those. I hope I will finally succeed then...

ADD REPLY
1
Entering edit mode

You may be able to do this by using the tool reformat.sh from BBTools without remapping.

Can you try:

reformat.sh in=your.bam out=fixed.bam trd=t

This should truncate the reference names after the first space.

In future you can add the trd=t flag to your bbmap.sh command to shorten the header names. BBMap passed through long names as is as opposed to other aligners (most?) that truncate the names after first space.

ADD REPLY
0
Entering edit mode

I think I even used reformat.sh earlier today, but don´t remember with what arguments. I now tried what you suggested, but got an error:

java.lang.AssertionError: Non-digit character with ASCII code -52 was encountered. x=-100; char= array= `here the non-ASCII characters which I cannot post on Biostars`, start=205, stop=308
        at shared.Parse.parseInt(Parse.java:283)
        at stream.SamLine.<init>(SamLine.java:493)
        at stream.SamReadInputStream.toReadList(SamReadInputStream.java:119)
        at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:90)
        at stream.SamReadInputStream.hasMore(SamReadInputStream.java:54)
        at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:668)
        at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:657)

In terms of adding trd=t to the bbmap.sh, thank you for that information, cool! I started to re-run scripts now with short-header-fasta as reference anyways; do you think the one or other option (using fasta with shortened headers vs. bbmap.sh with trd=t option) would have other advantages or disadvantages?

ADD REPLY
0
Entering edit mode

do you think the one or other option (using fasta with shortened headers vs. bbmap.sh with trd=t option) would have other advantages or disadvantages?

No advantage. In case you wanted to retain the long headers you could reformat the BAM afterwards.

Not sure why you got the error above. Perhaps you are using a non-US locale.

ADD REPLY
0
Entering edit mode

It worked now, after re-mapping with short-headed FASTA! :) Thanks a lot again for your time.

ADD REPLY
0
Entering edit mode

Hi, just a little feedback: I tried reformat.sh as you suggested, as well, and it also fixed the problem :) Thanks again!

ADD REPLY
1
Entering edit mode

Now I would like to manually check specific positions/ variants in the mapped data using Tablet and/or IGV.

A better option may be to create small BAM's using the region you are interested in. You can do that by providing co-ordinates or a BED file containing regions of interest.

ADD REPLY
0
Entering edit mode

Thank you for the suggestion. I think I did something similar some years ago. Will try it! So hard to get back into the topic after a long period of only wetlab...

ADD REPLY

Login before adding your answer.

Traffic: 2268 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6