Best pipeline for RNAseq assembly and analysis (or help with stringtie assembly)
0
0
Entering edit mode
2.4 years ago
Katherine • 0

Hello,

I am new to bioinformatics and have been working with three sets of RNAseq data by Ilumina, two of which are of a certain disease condition and the other is a control (this low sample number makes analysis very hard). The end goal is to identify and/or confirm biomarkers for the disease.

I have assembled the sequences using hisat2 to map to the GRCh38 human reference genome and stringtie for the assembly, creating output files that I could feed into Ballgown for analysis. I have also done some analysis with DESeq2. I have several problems:

  1. My assembled transcripts are all (I think) labeled with MSTRG.[#], a labelling convention that is assigned by stringtie for unknown transcripts. However, when I manually take some of these sequences and use a genome viewer, they are clearly matching to a gene. How do I get stringtie to actually map the gene names to the transcripts? Is this a problem with my reference files?

  2. I have been unable to extract the fasta files from the gtf files that I have created. I have tried gffread which gave me errors and cannot get agat to download. How do I extract the fasta files?

  3. How do I analyze only 3 datasets? Are there databases with other RNAseq data that I could easily download to supplement my analysis? I am really trying to get together a pipeline for when we get hopefully 200 samples.

I mainly want to know if there are important steps that I am missing or if there is a better assembly and analysis pipeline that is more up-to-date for human transcriptome assembly. Tips about checking the quality of assembly would be fantastic too. Happy to provide more info; any advice/resources would be great!

human assembly transcriptome • 966 views
ADD COMMENT
0
Entering edit mode

For point 1) use updated gtf for assigning transcripts.

ADD REPLY
0
Entering edit mode

I thought that the file I used was the most up to date. Here is the header of the file:

gff-version 3
!gff-spec-version 1.21
!processor NCBI annotwriter
!genome-build GRCh38.p14
!genome-build-accession NCBI_Assembly:GCF_000001405.40
!annotation-source NCBI Homo sapiens Annotation Release 110
sequence-region NC_000001.11 1 248956422
species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606

Is there a separate program that can assign the transcripts after assembly?

ADD REPLY

Login before adding your answer.

Traffic: 1714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6