Entering edit mode
4.4 years ago
nonaddldy
▴
10
I am processing with the bam file whose Biosample name is SAMN00797154 and DNA-ID is NA07051 from 1000 genome project. And I filtered out the reads aligned to the chromosome Y, however, I found their sequences are totally different, even though they were aligned to the same position, so how can I select the read sequences if I try to assembly them.
A00132:45:HCFCVDSXX:4:1340:13810:3020 99 chrY 2781477 60 13S137M = 2781668 342 GATTAAAAGAAGTTAAGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGATTGGCGGATCACGAGGTCAGGAGATCGAGACCATCTTGGCTAACACCGCGAAACCCCGTCTCTACTAAAAATACAAAAAAAT ??????????????????????????5????????????????+??????5??55???????????+??5????+??+?????+??????????????????????????????????????+??5???????????????????????+ MC:Z:94M1D56M PG:Z:MarkDuplicates MQ:i:60 AS:i:132 XS:i:102 MD:Z:0N0N0N53G80 NM:i:4 RG:Z:NA07051_CCAAGTCT-AAGGATGA_HCFCVDSXX_L004
A00217:59:HCFNWDSXX:1:1622:18756:6449 99 chrY 2781477 42 40S110M = 2781841 514 ATGCTCAGGACTCAGACCTTAGTTATAGATTAAAAGAAGTTAAGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGATGGGCGGATCACGAGGTCAGGAGATCGAGACCATCTTGGCTAACACCGCGAAACC ??????????????????5??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? SA:Z:chrY,12476915,-,107S40M3S,0,0; XA:Z:chr18,-70930457,107M43S,3; MC:Z:150M PG:Z:MarkDuplicates MQ:i:60 AS:i:110 XS:i:93 MD:Z:0N0N0N107 NM:i:3 RG:Z:NA07051_CCAAGTCT-AAGGATGA_HCFNWDSXX_L001
A00217:59:HCFNWDSXX:3:1605:10366:13401 163 chrY 2781477 60 22S128M = 2781693 367 TTAGTTATAGATTAAAAGAAGTTAAGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGATGGGCGGATCACGAGGTCAGGAGATCGAGACCATCTTGGCTAACACCGCGAAACCCCGTCTCTACTAAAAATA ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????+?????????????????? XA:Z:chr18,-70930439,125M25S,3; MC:Z:68M1D82M PG:Z:MarkDuplicates MQ:i:60 AS:i:128 XS:i:110 MD:Z:0N0N0N125 NM:i:3 RG:Z:NA07051_CCAAGTCT-AAGGATGA_HCFNWDSXX_L003
I guess, the reads are soft clipped so the clipped sequences are in the read sequence. You can cut out first 13 bases of read1, first 40 bases of read2 and first 22 bases of read3 in your example.
Thanks for your advice, so I am puzzed about how long are the clipped sequences now. Can I get the related information from the file instead of processing alignment again.
It's in the CIGAR string. 13S, 40S and 22S.
Thanks a lot.