Hi, I have a single large bam file with reads from multiple samples. I want to do one of 2 things.
- split the bam file into multiple files each with reads from different samples using the barcode for each sample.
- If the above is difficult or will take too long, then I want to be able to extract the reads for one specific samples using the barcode into one bam file.
The first 5 lines of my bam file looks like this:
A00767:101:HWK2TDMXX:1:1101:17508:4288_CAACCTCTGATGGCCA_CTGGTCGT 163 1 145212833 255 37M = 145212870 75 AGGTCCAGGAGGCAGAAGTGAGTCATTTGGGGAGCAG FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF NH:i:1 HI:i:1 AS:i:73 nM:i:0
A00767:101:HWK2TDMXX:1:1101:17508:4288_CAACCTCTGATGGCCA_CTGGTCGT 83 1 145212870 255 38M13S = 145212833 -75 GGCAAGGCAAGTGTAAAAGGGAATTTCAGGGTAGCATTAGGTCCAGGAGGC FFFFFFFFF:FFFFFFFF:FFFFF:FFFFFFF,FFFFFFFFFF::FFFFF: NH:i:1 HI:i:1 AS:i:73 nM:i:0
A00767:101:HWK2TDMXX:1:1101:17616:4288_ATTATCGACATCCTAA_TTTATAAT 163 MT 1199 255 37M = 1775 623 CTATAGAACTAGTACCGCAGGGGAAAGATGAGAGACT FFFFFFFFFFFF,FF:F:F,:FFFFFFFFFF,FFFFF NH:i:1 HI:i:1 AS:i:78 nM:i:2
A00767:101:HWK2TDMXX:1:1101:17616:4288_ATTATCGACATCCTAA_TTTATAAT 83 MT 1775 255 47M = 1199 -623 GTATAACAACTCGGATAACCATTGTTAGTTAATCAGACTATAGGCAA FFFFFFFFFFFF::FFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:78 nM:i:2
A00767:101:HWK2TDMXX:1:1101:17671:4288_TAACTAGCACCTGCTT_GCTTACAA 163 11 30648176 255 37M = 30648289 164 GGAGGAGGAGGAGGAGAAAGAGGAAAAGGAAAAGGGA FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF NH:i:1 HI:i:1 AS:i:86 nM:i:0
I would greatly appreciate the help!
You want to split the files based on the indexes (
CAACCTCTGATGGCCA_CTGGTCGT
) in the read name?Yes. Preferably by the index sequence to the left of the underscore (Dual index). I believe the right part is the UMI.