I think I have combination of two not so usual tasks: 1) mapping of genomic reads (with introns) to transcript reference 2) mapping to reference which is not of the same organism as the mapped reads but rather (very) closely related
More specifically: I have metagenome data (short reads) and I expect that one of the organisms in the sequenced community is eukaryote A in which I'm particularly interested. I have also sequences for few hundreds of transcripts from closely related organism, eukaryote B. Given the relatedness of A an B and the set of transcripts I'm interested in I expect that typical transcript from B will have an ortholog (homolog) in A with more than 70% nucleotide sequence identity.
Would you suggest particular tools or workflow to map the genomic reads from A to transcripts from B and thus reconstruct homologs of B transcripts which are present in A genome? Even being able to assemble from the reads from B some transcripts might be enough for my purpose.
I will try to use bowtie 2 to map the genomic reads to the transcriptome reference so I will basically ignore the genomic reads which map to the splice junctions and I will hope that by playing with bowtie scoring options I will be able to get mapping on nonidentical reference. However, this seems to me really suboptimal and I will be thankful for any suggestions!
ori, thanks for the suggestion! I assembled the reads by SPAdes and the genes of my interest are heavily fragmented as I guess from blasting the assembly - I'm receiving relatively short hits from many assembled contigs for each query (reference transcript).
Mapping with bowtie 2 actually seems to work - the low coverage (generally 0-5) is obvious yet it should provide me enough sequence info for my task. The overall similarity between the reference and the reads is much higher than I was expecting - it might be close to 90% - which might explain why the assembly worked out. However I'm now struggling with extracting the consensus of the aligne reads. I used for that combination of samtools, bcfutils and vcfutils:
However, the consensus I'm getting this way contains a substantial proportion of the reference sequence and I can't find parametres which could modify how the consensus is calculated to have there either Ns for the part of reference where no reads were mapped or the consensus of the mapped reads.
Yet another approach - extracting consensus using IGV viewer (option Copy consensus sequence) gives me exactly the consensus I'm looking for but there is apparently no way how to automatize this for hundreds of sequences.
In case you have any suggestions on extracting and fine-tuning the consensus from bam files I would be thankful for that!