Question

Merge and unique multiple complete sam files from identical fastq files

0

Entering edit mode

3 months ago

JustinZhang ▴ 140

1. Background

input: X.R1.fq.gz, X.R2.fq.gz, GRCh38.fa

I'm trying mapping X.R1.fq.gz and X.R2.fq.gz to GRCh38.fa using last 3 times, with 3 seeding schemas (BISF, BISR, NEAR), resulting 3 output.sam files.

output: X.BISF.sam, X.BISR.sam, X.NEAR.sam All the sam files are considered "complete" because each of them contains all the reads in X.R1.fq.gz + X.R2.fq.gz and their mapped info. The major difference among them is that a single read can be different status (different number of records with different FLAGs) due to the seeding schemas.

2. What do I want to do:

Merge all the sam files, and remove the conflicting records for every read. For example, there may be 3 records with same read ID and same primary alignment FLAG, but I need to only keep the best one.

3. My current thoughts (not solution):

concat sam files, sort records by read ID, FLAG and MAPQ (descending order), and only keep the very first record for each unique "read ID - FLAG" combination. something like this:

{
samtools view X.BISF.sam ;
samtools view X.BISR.sam ;
samtools view X.NEAR.sam ;
} |\
sort -k1,1 -k2,2 -k5,5r |\
awk "!seen[1,2]++"

4. Why I post this question

I admit that my understanding about sam format is quite limited and my thought is kinda rough. I'd like to get your help to make the output.sam right. Thank you all.

samtools aligment sambamba WGBS WGS • 527 views

ADD COMMENT • link 3 months ago by JustinZhang ▴ 140

0

Entering edit mode

The is my current workaround:

from argparse import ArgumentParser
from pathlib import Path
import polars as pl


def main():
    parser = ArgumentParser(description=('Merge and unique BISF, BISR and '
                                         'NEAR sam files to output.sam file.'))
    parser.add_argument('--bisf', help='Path to the BISF sam file.')
    parser.add_argument('--bisr', help='Path to the BISR sam file.')
    parser.add_argument('--near', help='Path to the NEAR sam file.')
    parser.add_argument('--sam_tmp', help='Path to the tmp sam file.')
    args = parser.parse_args()

    for _f in (args.bisf, args.bisr, args.near):
        if not Path(_f).exists():
            raise FileNotFoundError(f'File {_f} does not exist.')

    Path(args.sam_tmp).parent.mkdir(exist_ok=True)

    reads_df: pl.LazyFrame = (
        pl.concat(items=[
            pl.scan_csv(args.bisf, separator='\t', has_header=False,
                        schema={'QNAME': pl.String, 'FLAG': pl.UInt16, 'RNAME': pl.String,
                                'POS': pl.Int64, 'MAPQ': pl.Int64, 'CIGAR': pl.String,
                                'RNEXT': pl.String, 'PNEXT': pl.Int64, 'TLEN': pl.Int64,
                                'SEQ': pl.String, 'QUAL': pl.String}),
            pl.scan_csv(args.bisr, separator='\t', has_header=False,
                        schema={'QNAME': pl.String, 'FLAG': pl.UInt16, 'RNAME': pl.String,
                                'POS': pl.Int64, 'MAPQ': pl.Int64, 'CIGAR': pl.String,
                                'RNEXT': pl.String, 'PNEXT': pl.Int64, 'TLEN': pl.Int64,
                                'SEQ': pl.String, 'QUAL': pl.String}),
            pl.scan_csv(args.near, separator='\t', has_header=False,
                        schema={'QNAME': pl.String, 'FLAG': pl.UInt16, 'RNAME': pl.String,
                                'POS': pl.Int64, 'MAPQ': pl.Int64, 'CIGAR': pl.String,
                                'RNEXT': pl.String, 'PNEXT': pl.Int64, 'TLEN': pl.Int64,
                                'SEQ': pl.String, 'QUAL': pl.String})],
                  how='vertical')
          .sort(by=['QNAME', 'FLAG', 'MAPQ'], descending=[False, False, True])
          .unique(subset=['QNAME', 'FLAG'], keep='first'))

    reads_df.sink_csv(path=args.sam_tmp, separator='\t', include_header=False)


if __name__ == '__main__':
    main()

It only works for last output sam files without quality scores (QUAL field ==*). If the QUAL is actual ASCII strings (i.e. !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65), the polars will pop escape character associated error, and we have to fallback to line parsing method.

ADD REPLY • link 3 months ago by JustinZhang ▴ 140

score 2 · Accepted Answer · 2025-04-05

I wrote https://jvarkit.readthedocs.io/en/latest/SamRemoveDuplicatedNames/ this program just removes the duplicated name+flag but doesn't look at the mapq, at least the duplicates reads will be removed.

your problem would be (not tested) :

# convert the sam to bam , index each bam, and then

samtools merge -u -O BAM -o '-'  *.bam  | java -jar jvarkit.jar samrmdupnames

update: to keep the best MAQ, the java code should be changed as:

index a30e02a52..ca1875f34 100644
--- a/src/main/java/com/github/lindenb/jvarkit/tools/samrmdupnames/SamRemoveDuplicatedNames.java
+++ b/src/main/java/com/github/lindenb/jvarkit/tools/samrmdupnames/SamRemoveDuplicatedNames.java
(...)
@@ -190,16 +190,22 @@ public class SamRemoveDuplicatedNames extends Launcher {
                                        }
                                }
                        }
-                       else
+                       
+               else
                        {
                        int x=0;
                        while(x +1 < buffer.size()) {
-                               final SAMRecord rec1 = buffer.get(x);
+                               SAMRecord best_mapq = buffer.get(x);
                                int y = x+1;
                                while(y < buffer.size()) {
                                        final SAMRecord rec2 = buffer.get(y);
-                                       final boolean same_name = sameRead(rec1,rec2);
+                                       final boolean same_name = sameRead(best_mapq,rec2);
                                        if(same_name) {
+                                               //update 20250405, read at[y] as better MAPQ than read[x], just keep that read[y]
+                                               if(best_mapq.getMappingQuality() < rec2.getMappingQuality()) {
+                                                       best_mapq  = rec2;
+                                                       buffer.set(x, best_mapq);
+                                                       }
                                                buffer.remove(y);
                                                n_removed++;
                                                }

this change will be present in the next release.