Hello dear colleages!
I have recently faced some problem. I have worked with long WGS reads. Firstly I have filtered the longest subset of reads, and aligned them to the custom sequence with several structural variants of the region of interest. After that I have extracted reads mapped to one of variants and detected some reads be much shorter than previously aligned in extracted subset. It is not clear for me what was wrong? May alignment params, or extraction procedure be wrong?
So here is commands I used:
<$INITIAL_READS bioawk -c fastx \
'length($seq) > 10000 \
{ print "@" $name "\n" $seq "\n+\n" $qual }' \
> $FILTERED_READS.fastq
minimap2 -a --secondary=no -L -2 --sam-hit-only -x map-ont -H \
$STRUCTURAL_VARIANTS $FILTERED_READS > lr_2_vars.sam
samtools faidx $STRUCTURAL_VARIANTS
samtools view -S -b lr_2_vars.sam | \
samtools sort > lr_2_vars.sorted.bam
samtools index lr_2_vars.sorted.bam
samtools view -t ${STRUCTURAL_VARIANTS}.fai \
-b lr_2_vars.sorted.bam VARIANT_OF_INTEREST \
> VARIANT.bam
samtools fastq -F 4 VARIANT.bam > EXTRACTED_READS.fastq
So I had this result:
<$EXTRACTED_READS bioawk -c fastx \
'{ print length($seq)}' | sort -V | head
52
54
55
56
56
56
56
62
73
77
Thank you a lot Istvan Albert ! I had thought there be three options for partially aligned reads: 1) without clipping 2) soft clipping with lowcase masking 3) hard clipping with reads been truncated
I have to read more on specification of SAM format and to look on issues on minimap2's github.