Question

Filling gaps in BAM file

0

Entering edit mode

11 months ago

hemr3 ▴ 10

Hi! I have BAM files that contain fairly large numbers of gaps due to the fact they are aDNA data. The BAM files have EOFs and look like this (only a snippet shown below):

11:57001065-57004724
SN7001204_0523_AHJLV3BCXX_R_PEdi_L5727_37_1:1:1106:9922:91821c  16  11  57000970    37  113M    *   0   0   TTCTCTTTACGGCTTCCCCCTCACCATGCTGAAACATTTGCAAGCTGGAGGTTATCCCAAGACACTTTCATGAGAAGAGGCTTTGGACACAACAGTTGCACAAACAGTTTTAT   ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]   X0:i:1  X1:i:0  XT:A:U  RG:Z:L5729  XI:Z:AGCAATC    XJ:Z:CATGCTC    YI:Z:DDDDDII    YJ:Z:DDDDDII    NM:i:0  XP:i:2  MD:Z:113    FF:i:2  Z0:i:0
D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:7:1312:3759:89487   0   11  57000990    37  125M    *   0   0   TTACCATGCTGAAACATTTGCAAGCTGGAGGTTATCCCAAGACACTTTCATGAGAAGAGGCTTTGGACACAACAGTTGCACAAACAGTTTTATTTGATGAACCACAGTGACTAACAGGATCAGAA   B@BBBFGGGGCFGGGGGGGEGGGGGEGFGEGDGGGGFEGGGGGGGGGGG]]]]]]]]]]]]]]]]]]]]]]]]]]]GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCBCC   XI:Z:ATCATAA    YI:Z:A@A<A;B    XJ:Z:AATAGTA    YJ:Z:333<<>F    FF:i:2  Z0:i:0  XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:1C123  RG:Z:L5832
SN7001204_0528_BHJL5LBCXX_R_PEdi_L5727_37_6:1:2214:18489:14905  0   11  57001002    37  86M *   0   0   AACATTTGCAAGTTGGAGGTTATCCCAAGACACTTTCATGAGAAGAGGCTTTGGACACAACAGTTGCACAAACAGTTTTATTTGAT  DDDDDIIIII]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]IIIIIDDDDD  XI:Z:TTAATAG    YI:Z:DDDDDII    XJ:Z:GTCCGGC    YJ:Z:DDDDDII    FF:i:2  Z0:i:0  XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:12C73  RG:Z:L5728
D00829_0044_ACADL6ANXX_PEdi_Ay_VS_2:5:2213:11747:60758c 147 11  57001005    60  76M =   57000916    -165    ATTTGCAAGCTGGAGGTTATCCCAAGACACTTTCATGAGAAGAGGCTTTGGACACAACAGTTGCACAAACAGTTTT    ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]W]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]][T    X0:i:1  X1:i:0  XT:A:U  RG:Z:L5831  XI:Z:CCTAACG    XJ:Z:CAGTACT    YI:Z:BBBCCGD    YJ:Z:BBBBCGG    AM:i:37 SM:i:37 NM:i:0  XP:i:2  MD:Z:76 Z0:i:0
D00829_0050_BCADRUANXX_PEdi_VS_L5848_L5851_1:2:1207:14230:77955c    16  11  57001005    37  86M *   0   0   ATTTGCAAGCTGGAGGTTATCCCAAGACACTTTCATGAGAAGAGGCTTTGGACACAACAGTTGCACAAACAGTTTTATTTGATGAA  ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]  X0:i:1  X1:i:0  XT:A:U  RG:Z:L5848  XI:Z:GCCATGC    XJ:Z:AACCAAG    YI:Z:BB@BBGE    YJ:Z:@BBABGG    NM:i:0  XP:i:2  MD:Z:86 FF:i:2  Z0:i:0

There are clearly gaps here, but I really need no gaps for the pipeline I'm trying to run (protein structure analysis).

I would like to fill these gaps with a human reference sequence (which I already have), and keep a record of how much of the BAM file gaps are filled with the reference sequence.

How would I go about doing this? Thank you!

bamtools samtools gene bam • 805 views

ADD COMMENT • link updated 11 months ago by seidel 11k • written 11 months ago by hemr3 ▴ 10

0

Entering edit mode

aDNA = ancient DNA? By "gaps" do you mean sections of a reference assembly where no reads overlap? How do you detect the gaps? I can imagine identifying gaps, grabbing tiles from the reference assembly, and making up fake reads with unique read IDs that you could merge with your existing BAM file. This would be fairly easy to do with R (GenomicRanges library, Biostrings, etc.) but could also be done a number of other ways. What tools do you prefer?

ADD REPLY • link 11 months ago by seidel 11k

0

Entering edit mode

aDNA = ancient DNA. By gaps, I think I mean just sections of the genome where there was not sufficient coverage or data to produce reliable reads for the BAM file. I'm very open to using any tools - but I do have some proficiency in R! I'm not sure how to take the 'tiles' from the reference assembly and merge them with my existing files. I do need the read IDs to fit into the gaps in the BAM file however. Thank you for your time!

ADD REPLY • link 11 months ago by hemr3 ▴ 10

0

Entering edit mode

A solution that involves making up reads to look like something else feels a bit wrong. Sounds like a so-called XY problem:

https://xyproblem.info/

that goes like this:

User wants to do X.
User doesn't know how to do X, but thinks they can fumble their way to a solution if they can just manage to do Y.
User doesn't know how to do Y either.
User asks for help with Y.
Others try to help user with Y, but are confused because Y seems like a strange problem to want to solve.
After much interaction and wasted time, it finally becomes clear that the user really wants help with X, and that Y wasn't even a suitable solution for X.

I would recommend that you state the original problem rather than coercing a nail to fit a hammer.

Perhaps what you need is a consensus sequence, or perhaps what you need is something else altogether.

Backfilling the entire genomic DNA with made-up reads, where the goal primarily is to perform protein analysis, feels like a stretch.

After all, a single indel mistake, a single splicing signal error, would propagate down and alter the entire protein sequence!

ADD REPLY • link 11 months ago by Istvan Albert 101k

0

Entering edit mode

By gaps, I think I mean just sections of the genome where there was not sufficient coverage or data to produce reliable reads for the BAM file.

I sort of want to draw a picture to verify what you have, and I worry you might be stuck in XY space as Istvan Albert points out. At the moment we have only words, so we must be precise. I want to know that (1) you have DNA reads from a sample, and (2) you have a reference genome you used to align those reads. Is this correct? The definition of a BAM file is an alignment map between reads and a reference. Your question implies there are contiguous sections of the reference where no reads from your sample map to, and that you'd like to generate reads from the reference to fill in this space. Given that you're working with ancient DNA I would guess you want to do some kind of structural analysis comparing ancient sequence to modern (i.e. your reference genome). (I hate guessing, but if so, why not map your reads to the reference to do variant analysis, and then insert the variations into the gene sequences from the reference? There are methods for altering sequences given a file of variation from the reference).

ADD REPLY • link 11 months ago by seidel 11k