I'm new to bioinformatics and I've been tasked with performing analysis of RNA-seq data. Essentially I have lots of Illumina paired reads from different samples that I want to map to a reference and carry out differential expression analysis with.
Since the species in question is a non-model species, there is no sequenced, annotated genome available. I also cannot assemble a transcriptome de novo.
Instead, I have a set of several thousand 'contigs' assembled from thousands of 'expressed sequence tags' (ESTs). These contigs have been annotated in so far that each one has been given a (probable) biological function. For example:
Contig ID : Contig0001
Contig Description : rrm-containing protein
Contig Length : 1235
1:ATACAGCTTGGAAAATTAAATACCTTTGCACTCTGTCTGATCACCTTCACAGCCTTGCTT
61:AGATTCCTTTTCTCTTCTCTATTTCTTCTTCTTCTTTTTTTGACATGGAAGAGGAAGAGC
[...]
1201:TTGTAGACATCCATTTTGTATACTCGGAATTTCTA
and
Contig ID : Contig0002
Contig Description : chaperone protein dnaj 16-like isoform x1
Contig Length : 948
1:TTTCTATTTTGCCTTCGATTAATTTTCATCTTTCAATAAGTTTTACTTTAATTTCTTCCG
61:ACTTCATTTTATTCACGCAACATTTCACCATTGAGTTTGCCAACTGAAGAGAACCGTAGC
[...]
901: CAGACAGCTTCGATGACAGTCAATAAGGATCCAGATGCGACATTTTTC
1) What exactly could these contigs represent? I initially thought they were complete transcripts, but I'm now thinking they only form part of transcripts?
2) How would I go about mapping my reads to these contigs? Is that even possible? It looks like the file isn't in fasta format - would a more suitable format look something like this?
>Contig0001|rrm-containing_protein|1235
ATACAGCTTGGAAAATTAAATACCTTTGCACTCTGTCTGATCACCTTCACAGCCTTGCTT
AGATTCCTTTTCTCTTCTCTATTTCTTCTTCTTCTTTTTTTGACATGGAAGAGGAAGAGC
[...]
TTGTAGACATCCATTTTGTATACTCGGAATTTCTA
3) If I did manage to map my reads and perform quantification - what exactly am I quantifying? Would it be transcript-level quantification or gene-level quantification?
Any help would be greatly appreciated! Let me know if you need more info.
I disagree. You don't need transcript length information to perform differential expression analysis with DESeq or edgeR and similar raw-counts based methods. In addition, if you really want to compute FPKM, contigs length can serve as a proxy for transcript length: If the true transcript is longer than the annotated contig, then the reads corresponding to the missing part will not map, thus the transcript true length is irrelevant here.
This may be a silly question - but how would I compute the counts using a tool like HTSeq. I understand how I would go about doing it if a well annotated genome was available, but that isn't my case. HTSeq requires an annotation file (GFF) but I don't have that - is it feasible to build my own annotation file with the data/info that I've been given? Or should I explore using an alignment-free method such as salmon?
Edit: in this post, Devon Ryan suggests filtering out multi mappers and then just counting reads per contig using something like samtools idxstats. This could be inaccurate with a de novo assembly since you'd likely be throwing away a lot of reads, however with my reference I'm getting ~0.03% multi mappers - does it seem reasonable to just chuck these away?
Found the corresponding paper here.
Its possible, you could do something like : seqname='contig name', start =1; end ='Contig Length'; strand='+'.But Devon's suggestion is much easier. Just be careful to differentiate between reads originating from the forward and reverse strand (if you have a strand-specific dataset).
That's the alternative. Both are ok IMO.
Sure.