1. Background
input: X.R1.fq.gz, X.R2.fq.gz, GRCh38.fa
I'm trying mapping X.R1.fq.gz and X.R2.fq.gz to GRCh38.fa using last 3 times, with 3 seeding schemas (BISF, BISR, NEAR), resulting 3 output.sam files.
output: X.BISF.sam, X.BISR.sam, X.NEAR.sam All the sam files are considered "complete" because each of them contains all the reads in X.R1.fq.gz + X.R2.fq.gz and their mapped info. The major difference among them is that a single read can be different status (different number of records with different FLAGs) due to the seeding schemas.
2. What do I want to do:
Merge all the sam files, and remove the conflicting records for every read. For example, there may be 3 records with same read ID and same primary alignment FLAG, but I need to only keep the best one.
3. My current thoughts (not solution):
concat sam files, sort records by read ID, FLAG and MAPQ (descending order), and only keep the very first record for each unique "read ID - FLAG" combination. something like this:
{
samtools view X.BISF.sam ;
samtools view X.BISR.sam ;
samtools view X.NEAR.sam ;
} |\
sort -k1,1 -k2,2 -k5,5r |\
awk "!seen[1,2]++"
4. Why I post this question
I admit that my understanding about sam format is quite limited and my thought is kinda rough. I'd like to get your help to make the output.sam right. Thank you all.
The is my current workaround:
It only works for last output sam files without quality scores (QUAL field ==
*
). If the QUAL is actual ASCII strings (i.e.!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
), the polars will pop escape character associated error, and we have to fallback to line parsing method.