I have a fasta file assembly and combining it with the raw reads we produced a .bam file which I converted to .sam .
The .sam information lines look like this:
A00321:42:HLLVYDSXX:2:2302:6153:3505 99 NODE_1_length_3415511_cov_137.721502 16 60 128M = 607 742 CGATTAGTCCGGCCAAATCGCCGTCGAGCGCAATGAACATAACGGTCTTGCCCTCAGCGCGCAGCGCATCGGCCTTGGCGTCGATTGTGGAGTGCTCGACGCCCATGATGTCCATCATAGCACCATTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF RX:Z:TTGAGGGTATAGTAGT QX:Z:FFFFFFFFFFFFFFFF TR:Z:GACACCG TQ:Z:FFFFFFF BC:Z:AGTTGCAG QT:Z:FFFFFFFF XS:i:-10 AS:i:0 XM:Z:0 AM:Z:0 XT:i:1 RG:Z:over_1kb:LibraryNotSpecified:1:unknown_fc:0 OM:i:60
Separated by mandatory fields it would be something like this:
QNAME: A00321:42:HLLVYDSXX:1:1644:2248:3881
FLAG: 99
RNAME: NODE_1_length_3415511_cov_137.721502
POS: 1
MAPQ: 60
CIGAR: 1S127M
RNEXT: =
PNEXT: 536
TLEN: 386
SEQ: ATCGGGTCTGACACCGCGATTAGTCCGGCCAAATCGCCGTCGAGCGCAATGAACATAACGGTCTTGCCCTCAGCGCGCAGCGCATCGGCCTTGGCGTCGATTGTGGAGTGCTCGACGCCCATGATGTC
QUAL: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
I'm actually interested in the meta data. I want to know how the RX: and BC: fields are distributed across the scaffolds in the original assembly.
I imagined the .sam file already contains the information about the assembly used to produce it. If I'm wrong, I'm sorry and please correct me, I'm just assuming.
What I want to do is, for each read in the .sam file, I find out its position in the assembled scaffold, and I record, Read_ID,Scaffold_ID,Read_Position_Inside_Scaffold,RX,BC
Then I want to use that database to analyse the distribution of RX and BC inside each scaffold.
That's what I want.
Ultimately what I'm trying to do is evaluate the quality of my assemblies based on the Barcode distribution.
I'm good at programming and parsing, I'm just having trouble figuring out, where, inside the .sam file, can I find the scaffold and scaffold position of each read.