I wrote https://jvarkit.readthedocs.io/en/latest/Biostar9566948/
$ java -jar dist/jvarkit.jar biostar9566948 -R src/test/resources/rotavirus_rf.fa src/test/resources/S1.bam -S
@HD VN:1.6 SO:coordinate
@SQ SN:RF01 LN:3302
@SQ SN:RF02 LN:2687
@SQ SN:RF03 LN:2592
@SQ SN:RF04 LN:2362
@SQ SN:RF05 LN:1579
@SQ SN:RF06 LN:1356
@SQ SN:RF07 LN:1074
@SQ SN:RF08 LN:1059
@SQ SN:RF09 LN:1062
@SQ SN:RF10 LN:751
@SQ SN:RF11 LN:666
@RG ID:S1 SM:S1 LB:L1 CN:Nantes
@CO biostar9566948. compilation:20230621100700 githash:8a9a881b htsjdk:3.0.4 date:20230621100755. cmd:-R src/test/resources/rotavirus_rf.fa src/test/resources/S1.bam -S
RF01_1_483_2:0:0_3:0:0_41 163 RF01 1 60 1= = 414 483 G 2 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:60 XS:i:0
RF01_8_542_1:0:0_2:0:0_95 99 RF01 8 60 1= = 473 535 A 2 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:69 XS:i:0
RF01_11_507_0:0:0_1:0:0_9e 99 RF01 11 60 1= = 438 497 G 2 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:70 XS:i:0
RF01_12_501_0:0:0_2:0:0_62 99 RF01 12 60 1= = 432 490 C 2 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:70 XS:i:0
RF01_27_590_3:0:0_1:0:0_68 163 RF01 27 60 1= = 521 564 G 2 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:55 XS:i:0
RF01_44_622_1:0:0_1:0:0_3a 99 RF01 44 60 1= = 553 579 C 2 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:65 XS:i:0
$ java -jar dist/jvarkit.jar biostar9566948 -R src/test/resources/rotavirus_rf.fa src/test/resources/S1.bam
@HD VN:1.6 SO:coordinate
@SQ SN:RF01 LN:3302
@SQ SN:RF02 LN:2687
@SQ SN:RF03 LN:2592
@SQ SN:RF04 LN:2362
@SQ SN:RF05 LN:1579
@SQ SN:RF06 LN:1356
@SQ SN:RF07 LN:1074
@SQ SN:RF08 LN:1059
@SQ SN:RF09 LN:1062
@SQ SN:RF10 LN:751
@SQ SN:RF11 LN:666
@RG ID:S1 SM:S1 LB:L1 CN:Nantes
@CO biostar9566948. compilation:20230621100700 githash:8a9a881b htsjdk:3.0.4 date:20230621100859. cmd:-R src/test/resources/rotavirus_rf.fa src/test/resources/S1.bam
RF01_1_483_2:0:0_3:0:0_41 163 RF01 1 60 1=69S = 414 483 GGCTATTAAAGCTATACAATGGGGCCGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACT 2222222222222222222222222222222222222222222222222222222222222222222222 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:60 XS:i:0
RF01_8_542_1:0:0_2:0:0_95 99 RF01 8 60 1=69S = 473 535 AAAGCTATACAATGGGGAAGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACTCACAATG 2222222222222222222222222222222222222222222222222222222222222222222222 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:69 XS:i:0
RF01_11_507_0:0:0_1:0:0_9e 99 RF01 11 60 1=69S = 438 497 GCTATACAATGGGGAAGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACTCACAATCCGC 2222222222222222222222222222222222222222222222222222222222222222222222 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:70 XS:i:0
RF01_12_501_0:0:0_2:0:0_62 99 RF01 12 60 1=69S = 432 490 CTATACAATGGGGAAGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACTCACAATCCGCA 2222222222222222222222222222222222222222222222222222222222222222222222 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:70 XS:i:0
RF01_27_590_3:0:0_1:0:0_68 163 RF01 27 60 1=69S = 521 564 GTATCATCTAATCTTGTCATAATATTTATCATATATATATAACTCACAATCCGCAGTTCAAATTCCAATA 2222222222222222222222222222222222222222222222222222222222222222222222 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:55 XS:i:0
RF01_44_622_1:0:0_1:0:0_3a 99 RF01 44 60 1=69S = 553 579 CAGAATATTTATCATTTATATATAACTCAGAATCCGCAGTTCAAATTCCAATATACTATTCTTCCAATAG 2222222222222222222222222222222222222222222222222222222222222222222222 MC:Z:M1 RG:Z:S1 NM:i:0 AS:i:65 XS:i:0
Can you further clarify or illustrate?
Essentially I want to use the full-length of the read to accurately map the read to the genome, but only want the first base to contribute to coverage counts. So after mapping reads to the genome, I want to trim each read so only its first base remains.
For instance in the attached picture I've highlight two reads. One maps to the Forward strand and the other maps to the Reverse strand.
The Forward read would be trimmed so that only the 5' T is retained. The Reverse read would be trimmed so that only the 3' A is retained.
If these were the only two reads present, the coverage plot would have values of 1 at base # 590 and 642 and coverage values of 0 everywhere else.
If you're wondering why I want to do this, the technique I use ligates an adapter to the piece of DNA that is closest to a protein of interest. Libraries are sequenced from this adapter, so the first base represents the closest location to the protein of interest. The other bases increase background signal.
Thanks for any help.
Since you are interested in only position 1 where a mapping starts you could simply take that position from your BAM file (column 4 in SAM format) and add those up.
Is this something that could be done with the
samtools
package, or would I need to use text processing such asawk
? Does column 4 take strandedness into account? The description of column 4 says "leftmost mapping position" - does this mean the 5' base of the read?Since the reads are always in 5' --> 3' direction it will always be the 5' end. In case you need to take strand information into account then you can process the
bitwise flags
in column 2 (16 for reverse strand).