Hello,
I am new to bioinformatics and have been working with three sets of RNAseq data by Ilumina, two of which are of a certain disease condition and the other is a control (this low sample number makes analysis very hard). The end goal is to identify and/or confirm biomarkers for the disease.
I have assembled the sequences using hisat2 to map to the GRCh38 human reference genome and stringtie for the assembly, creating output files that I could feed into Ballgown for analysis. I have also done some analysis with DESeq2. I have several problems:
My assembled transcripts are all (I think) labeled with MSTRG.[#], a labelling convention that is assigned by stringtie for unknown transcripts. However, when I manually take some of these sequences and use a genome viewer, they are clearly matching to a gene. How do I get stringtie to actually map the gene names to the transcripts? Is this a problem with my reference files?
I have been unable to extract the fasta files from the gtf files that I have created. I have tried gffread which gave me errors and cannot get agat to download. How do I extract the fasta files?
How do I analyze only 3 datasets? Are there databases with other RNAseq data that I could easily download to supplement my analysis? I am really trying to get together a pipeline for when we get hopefully 200 samples.
I mainly want to know if there are important steps that I am missing or if there is a better assembly and analysis pipeline that is more up-to-date for human transcriptome assembly. Tips about checking the quality of assembly would be fantastic too. Happy to provide more info; any advice/resources would be great!
For point 1) use updated gtf for assigning transcripts.
I thought that the file I used was the most up to date. Here is the header of the file:
Is there a separate program that can assign the transcripts after assembly?