Hello everyone,
Sorry for posting a simple question but although I have worked with some assembly before, I never worked with data at the whole genome scale and I would like to ask for some advice.
I followed this tutorial on biostairs: Reference Assembly - Mapping Reads To A Reference Genome but I ended up just converting my mapped to fastq (so I did not obtain mpileup files as I wasn't doing any variant calling - although this will likely change once I get my first basic pipeline working).
My question now is...what can I do with my mapped reads? Most tutorials like the one I linked sort of just end off after gathering bam files or converting them into fastq files of mapped reads and then it sort of just stops there.
Looking at other biostars forums looks like the next thing I need to do is to annotate my mapped reads - but how do I do that? I have an annotation.gff file but I really don't understand how do I link my bam files to the annotation.gff file. The coordinates in the bam file should be identical to the .gff file right? Am I supposed to annotate my bam files and append more information into it? And if so, is this something I can do with samtools or something similar?
My end goal right now is to visualize my assembly and get information on how many genes mapped to the ref genome, which genes mapped (i.e. annotations!), and just overall quality of the mapping. But I just don't understand how to get the information from the annotation.gff file onto my mapped reads in my bam file (which btw are all still separate sequences in a read1.fastq and read2.fastq file). Any tips/advice is much appreciated. Thank you for reading!
It looks like you just aligned your reads to a reference assembly. You did not actually do an assembly. These are two different workflows.
That said what was the starting point of this entire exercise. What question are you asking of this data?
Doing an actual assembly would mean de novo right? So without a reference genome? I also do have a a transcript.fasta.gz file, along with a annotation.gff.gz, a RepeatMasked.gff.gz file and a Repeat.fasta.gz (genome file). The data was given to me to work on although I will specifically be working on unmapped reads later on (trying to figure out what other species were sequenced). Right now, I just want to know what I can do with my mapped reads and the annotation file. I've never worked with transcript data before, so I am assuming since I have this file I should be able to get some expression analysis done but this would literally be my first time even seeing this type of data.
From my understanding, I can load the annotation.gff file and my mapped reads, along with my genome.fasta file into IGV and see the data but just looking at it in IGV is not meaningful. I need to gather stats - like what genes are mapped, and possibly how well they mapped to the genome? I also am trying to understand what else people can do with the annotation file and mapped reads.
I will probably also be asked later on to work with the transcript.fasta.gz file for expression analysis (but I have never worked with this type of data ever so this will be entirely new).