Hi All,
I am very new to the field of metagenomics and I am trying to quantify my data. Basically I have metagenome sequencing samples (i.e. reads) in fastq format, these are paired end reads, the quality was pretty bad at first so I trimmed some of the reads to maintain a certain phred score, both of the pairs were dropped if one of them was below the quality score cut off. I want to quantify the depth of sequencing and calculate the coverage FPKM and TPM etc and other statistics using this data. I first mapped this metagenome to representative bacterial genomes using bowtie2 and I used the '-a' argument so that the output keeps multi-mapped reads and not just the best one. After this I obtained a SAM file. now I want to calculate the coverage of my representative genomes using these reads from the SAM file, I want to make my own script in order to understand what's going on rather than using any other script. All the blogs so far have just been pointing to software that's out there which I do not want to use since I would not know what they are actually doing.
I have major confusions of how to do this. Ideally I would want to see how many reads map to each of my genomes, the length of the aligned portions of these genes with the genomes and how many of them are unique and how many are multimapped. The SAM file is very confusing to extract this information and the fact that it is a paired end read its making the subject even more confusing.
1) since it is a paired end read, quantifying DNA for this purpose, Do I count each of the pairs separately as reads being mapped to the genome or I count a pair only as one read being mapped to my genome?
2) what if these pair are of different sizes (since I have trimmed my fastq files for quality control) which length of the pair I record as being mapped to the gene?
3) what if the paired reads map at overlapping positions on the genome? what if they do not overlap at all and there is a gap in between (i.e. there is an insert that between the end of one of the paired reads and the start of the other paired reads) and the reads do not overlap at all? do I count this case as two reads mapping to the genome?
I am very confused of how to proceed, can someone provide me some detailed guidelines of how to do this type of quantification?
thanks for your time.
The more I think about it the more I'm tempted to just simply for each gene record how many reads are being mapped back and the length of each read that is being aligned respectively. And to treat each of the paired end reads separately as different (independent reads mapping back to the same gene). Offcourse I would divide the reads mapped to multiple reference seqs before accounting them for genes coverage.