Hii people,
I have paired-end sequence read of my target species which I have mapped to the reference genome which outputed bam files. From this bam files I wish to extract protein coding genes using homology based approach or de novo.
Question: Is it in order if I convert the bam files(obtained by mapping to reference) to fastq files, merge the paired reads, then convert the fastq files to fasta format? The idea is to get fasta files from the reads for feature prediction purposes. If that is the way to do it how to one get continues fasta sequences , since on following the mentioned step I get something like this:
HWI-574/1 TTCTTGGTCCATGTACTGCTGAAGCCCTGGCATGTGAAATGAGTGCAAATGTACAGTAGTTTGAA 55086/1 TAAAAGTTCTTGGTCCATGTACTGTTTCCTTACTGGCATGTGAAATGAGTGCAAATGTACAGTAGTTTGAA HWI-D00466:100:CADMYANXX:7:1111:1943:12062/1 AGTGAAGCAGAAGTGGATATTTTTCTGGAATTCCCTTGCTTTCTCTGTGATCCAAGGGAT 75804/1 CCCTTGGATCACAGAGAAAGATATCCACTTCTGCTTCACTGACTACACTTAAAGCCTTTGACTGTGT 16:15787:83520/1 GAAAGCAAGGGAATTCCAGAAAAATATCCACTTCTGCTTTTGACTGTGTGGATCACAACAAGC
I expect to get something like this;
chr1 TTCTTGGTCCATGTACTGCTGAAGCCCTGGCATGTGAAATGAGTGCAAATGTACAGTAGT TTGAATAAAAGTTCTTGGTCCATGTACTGTTTCCTTACTGGCATGTGAAATGAGTGCAAA TGTACAGTAGTTTGAAAGTGAAGCAGAAGTGGATATTTTTCTGGAATTCCCTTGCTTTCT CTGTGATCCAAGGGATCCCTTGGATCACAGAGAAAGATATCCACTTCTGCTTCACTGACT ACACTTAAAGCCTTTGACTGTGTGAAAGCAAGGGAATTCCAGAAAAATATCCACTTCTGC TTTTGACTGTGTGGATCACAACAAGC...........................................................
I would appreciate your input on how to go about it.
Hii Albert,
Thank you for the suggestion. So basically with whole genome data assembled by reference mapping , pulling out protein coding genes or mRNA is complicated assignment? My main goal is to get extract protein coding genes from this data for downstream work before embarking on denovo aspect. Let me think through the alternatives you have provided.
Yes, that's extremely complicated :) There's no perfect solution for it. But it seems likely that Istvan's recommendation of FastaAlternateReferenceMaker is the best approach. However, if you are dealing with a microbe or something not closely related to the reference, it might be better to just assemble it and annotate the assembly. The decision is affected by whether it's a eukaryote or prokaryote, and the ploidy. For a polyploid eukaryote, I don't even know of a good way to address the problem without thousands of hours of manual work. But it depends on how important the quality is... if you don't care about phasing and so forth, it's much easier. And if you don't have really long reads (PacBio, Nanopore, 10x), long range phasing is not possible anyway.