I am stuck at the stage where after aligning my single-end illumina reads using bwa mem to a reference file which consists of a single chromosome (I am aligning each time to a single chromosome due to time limits on my server), I cannot find complete tutorials that explain how to extract contigs/scaffolds from bam files in order to annotate. I've read similar posts here but they all seem to implement a de novo strategy (like velvetg which produces a nice file of contigs) or use annotation programs meant for mammalian genomes or bacteria - the species I work with are fish.
My code thus far (after running FastQC and trimmomatic to clean my fastq reads):
bwa index ref_chr1.fa
bwa mem -B 2 -M -t 40 ref_chr1.fa sp1_trimmed.fastq > sp1_trimmed.sam
samtools view -Sb -@ 40 sp1_trimmed.sam > sp1_trimmed.bam
samtools sort -@ 40 sp1_trimmed.bam > sp1_trimmed_sorted.bam
samtools index sp1_trimmed_sorted.bam
samtools mpileup -s -a -f ref_chr1.fa sp1_trimmed_sorted.bam -o sp1_trimmed_sorted.mpileup
HERE is where I am stuck. Do I even need the mpileup step? Can I just use my sorted and indexed bam files to somehow create scaffolds (no idea how to do that).
I have my species reference file in .gff format as well, as I just want to obtain all the CDS from my species so I can run various tests on each gene which I need separate so I can create MSAs for each (after extracting the gene for about 30 species).
Please any help is much appreciated!
what does it mean ? what do you want to annotate ?
I guess this is the part I am just not sure what to do - I have my sorted, indexed bam files and from an old tutorial where I used velvet for de novo, we had to end up with contig files that can be annotated. I don't know how to get to that step using my bam files after using bwa mem to assemble my reads. From what I've read the bam files just have assembled reads (contigs?) which I should assemble into scaffolds so I can then annotate and extract all protein-coding genes (I have a reference file with CDS sequences as well as the reference genome in .gff format).
I tried producing mpileup files to obtain consensus sequences but that failed - there were a huge amount of gaps in the data and I think it is because my data is exome sequencing and I was mapping to genomic so a lot of the intergenic regions are empty.