Question

Extracting reads from one barcode in a bam file

0

Entering edit mode

4.0 years ago

msrinivasan • 0

Hi, I have a single large bam file with reads from multiple samples. I want to do one of 2 things.

split the bam file into multiple files each with reads from different samples using the barcode for each sample.
If the above is difficult or will take too long, then I want to be able to extract the reads for one specific samples using the barcode into one bam file.

The first 5 lines of my bam file looks like this:

A00767:101:HWK2TDMXX:1:1101:17508:4288_CAACCTCTGATGGCCA_CTGGTCGT        163     1       145212833       255     37M     =       145212870       75      AGGTCCAGGAGGCAGAAGTGAGTCATTTGGGGAGCAG   FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF      NH:i:1  HI:i:1  AS:i:73 nM:i:0
A00767:101:HWK2TDMXX:1:1101:17508:4288_CAACCTCTGATGGCCA_CTGGTCGT        83      1       145212870       255     38M13S  =       145212833       -75     GGCAAGGCAAGTGTAAAAGGGAATTTCAGGGTAGCATTAGGTCCAGGAGGC     FFFFFFFFF:FFFFFFFF:FFFFF:FFFFFFF,FFFFFFFFFF::FFFFF:        NH:i:1  HI:i:1  AS:i:73 nM:i:0
A00767:101:HWK2TDMXX:1:1101:17616:4288_ATTATCGACATCCTAA_TTTATAAT        163     MT      1199    255     37M     =       1775    623     CTATAGAACTAGTACCGCAGGGGAAAGATGAGAGACT   FFFFFFFFFFFF,FF:F:F,:FFFFFFFFFF,FFFFF   NH:i:1     HI:i:1  AS:i:78 nM:i:2
A00767:101:HWK2TDMXX:1:1101:17616:4288_ATTATCGACATCCTAA_TTTATAAT        83      MT      1775    255     47M     =       1199    -623    GTATAACAACTCGGATAACCATTGTTAGTTAATCAGACTATAGGCAA FFFFFFFFFFFF::FFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFF    NH:i:1  HI:i:1  AS:i:78 nM:i:2
A00767:101:HWK2TDMXX:1:1101:17671:4288_TAACTAGCACCTGCTT_GCTTACAA        163     11      30648176        255     37M     =       30648289        164     GGAGGAGGAGGAGGAGAAAGAGGAAAAGGAAAAGGGA   FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF      NH:i:1  HI:i:1  AS:i:86 nM:i:0

I would greatly appreciate the help!

extract bam barcode • 6.9k views

ADD COMMENT • link 4.0 years ago by msrinivasan • 0

0

Entering edit mode

You want to split the files based on the indexes (CAACCTCTGATGGCCA_CTGGTCGT) in the read name?

ADD REPLY • link 4.0 years ago by GenoMax 150k

0

Entering edit mode

Yes. Preferably by the index sequence to the left of the underscore (Dual index). I believe the right part is the UMI.

ADD REPLY • link 4.0 years ago by msrinivasan • 0

score 0 · Answer 1 · 2021-04-01

0

Entering edit mode

4.0 years ago

GenoMax 150k

If you don't know what indexes are being used for samples you can get those from

awk -F " " '{print $1}' your.sam | awk -F "_" '{print $2}' - | sort | uniq

Then you can simply grep out the lines you need for each. You will need to reconstruct valid SAM headers if you want to use the files for downstream analysis.

This could also be done via scripting as well. Someone may be along with that answer later.

ADD COMMENT • link 4.0 years ago by GenoMax 150k

0

Entering edit mode

I know what indexes are being used. Lets say I want the reads from a particular index. I used the following command and it gave me a fairly large file (6.6GB).

samtools view analysis.Aligned.out.bam | less -S | grep TGCTAACGAGAAGTAA | uniq -c > reads_per_barcode

I think you are mentioning that I need to add header to this, right? However, I am not sure what and how I should add.

ADD REPLY • link 4.0 years ago by msrinivasan • 0

0

Entering edit mode

Save the header from your original BAM file.

samtools view -H analysis.Aligned.out.bam > header

Then cat it at beginning of separated files before converting them back to BAM.

Note: I am not sure what uniq will do in your command. It may eliminate secondary alignments so be careful with that.

ADD REPLY • link 4.0 years ago by GenoMax 150k

0

Entering edit mode

Thank you. Then, should I just do

samtools view analysis.Aligned.out.bam | less -S | grep TGCTAACGAGAAGTAA > reads_per_barcode

ADD REPLY • link 4.0 years ago by msrinivasan • 0

0

Entering edit mode

There is no need for less. We are assuming that the index sequences are unique. There is a chance that index sequence may be picked up from sequence field so it may be safer to grep with underscores at beginning and end e.g. grep "_TAACTAGCACCTGCTT_". Perhaps that is why you got more data than expected with plain grep.

ADD REPLY • link 4.0 years ago by GenoMax 150k

0

Entering edit mode

I completed the process and added the header to the beginning of the file. I then renamed the file to .bam (not sure if that was the right way to do it). However, when I am trying to do make a .bai file, I get the error - that the file is not BGZF compressed.

ADD REPLY • link 4.0 years ago by msrinivasan • 0

0

Entering edit mode

No that is not the correct way to do it. You will need to convert the file to BAM using samtools view -b your.sam -o your.bam. If you are going to need to sort and index then you can do it in a single step samtools view -b your.sam | samtools sort --write-index -o sorted.bam -.

ADD REPLY • link 4.0 years ago by GenoMax 150k

0

Entering edit mode

Thank you very much. All this worked :)

ADD REPLY • link 4.0 years ago by msrinivasan • 0

score 0 · Answer 2 · 2021-04-02

I wrote a tool to split a BAM using a regular expression on the read names. The parenthesis are used to capture one or more group. See http://lindenb.github.io/jvarkit/Biostar9462889.html

Example:

$ java -jar dist/biostar9462889.jar --manifest jeter.mf -o TMP --regex '^(RF[0-9]+)_' src/test/resources/S1.bam
WARNING: BAM index file /home/lindenb/src/jvarkit/src/test/resources/S1.bam.bai is older than BAM /home/lindenb/src/jvarkit/sr
c/test/resources/S1.bam
[INFO][Biostar9462889]Creating output for "RF01" N=1
[INFO][Biostar9462889]Creating output for "RF02" N=2
[INFO][Biostar9462889]Creating output for "RF03" N=3
[INFO][Biostar9462889]Creating output for "RF04" N=4
[INFO][Biostar9462889]Creating output for "RF05" N=5
[INFO][Biostar9462889]Creating output for "RF06" N=6
[INFO][Biostar9462889]Creating output for "RF07" N=7
[INFO][Biostar9462889]Creating output for "RF08" N=8
[INFO][Biostar9462889]Creating output for "RF09" N=9
[INFO][Biostar9462889]Creating output for "RF10" N=10
[INFO][Biostar9462889]Creating output for "RF11" N=11
[WARN][Biostar9462889]0 read(s) where lost because the regex '^(RF[0-9]+)_' failed.

$ ls TMP/*.bam
TMP/split.000001.bam  TMP/split.000004.bam  TMP/split.000007.bam  TMP/split.000010.bam
TMP/split.000002.bam  TMP/split.000005.bam  TMP/split.000008.bam  TMP/split.000011.bam
TMP/split.000003.bam  TMP/split.000006.bam  TMP/split.000009.bam

$ cat jeter.mf
RF01    TMP/split.000001.bam
RF02    TMP/split.000002.bam
RF03    TMP/split.000003.bam
RF04    TMP/split.000004.bam
RF05    TMP/split.000005.bam
RF06    TMP/split.000006.bam
RF07    TMP/split.000007.bam
RF08    TMP/split.000008.bam
RF09    TMP/split.000009.bam
RF10    TMP/split.000010.bam
RF11    TMP/split.000011.bam