Hi,
I am currently using the GATK PathSeqPipelineSpark to extract microbial sequences from bulk-RNA sequencing data. During this process, a .bam file was generated, which I subsequently filtered using grep -v "*", retaining only the sequences that were successfully matched to microbes, resulting in a bacteria.bam file.
My goal is to construct a phylogenetic tree based on the sequences within this bacteria.bam file. To achieve this, I need to annotate each sequence's header line in the BAM file with the corresponding tax_id. However, I am unable to locate the taxonomic information (tax_id) within the BAM file.
I attempted the following approach:
- I used
grep -v "*"
to retain the successfully aligned microbial sequences. - I extracted the reference sequence IDs ($3) and the sequences ($10) from the BAM file and generated a FASTA file.
- I downloaded the accession2taxid file from NCBI and attempted to match the reference sequence IDs from the BAM file with the accessionID in the accession2taxid file. I then replaced the header lines in the FASTA file with the corresponding tax_id.
- This resulted in a FASTA file with headers representing tax_id and their corresponding sequences.
- To validate this approach, I compared the tax_id in the resulting FASTA file with those in the score.txt file generated by PathSeqPipelineSpark, but I found that the tax_id from score.txt could not be found in the FASTA file.
I am unsure where I went wrong in this process. Could someone please guide me on how to correctly process the bacteria.bam file and score.txt file from PathSeqPipelineSpark to construct a phylogenetic tree? Alternatively, if there is another method to achieve this, I would greatly appreciate your advice.
Thank you for your assistance!
Best regards,
Tongzhen