Entering edit mode
8 months ago
markusz
▴
10
Hello. I'm new to BioIT and I have a problem with generating genome index or counting genes expression. I know that it's because of naming differences between in FastA and GTF files. How do I correct it? Below there are sample lines from first FastA, and then GTF files.
Sequence ID: ENSSSCT00000002339.4 cdna primary_assembly:Sscrofa11.1:AEMK02000555.1:34878:35168:1 gene:ENSSSCG00000035087.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene
Sequence: AAACAGCATGTGAATCAGAGCCACGAAGCCCTGAGCGTCCGAGAGGGAGACGGCTTGGTTCTCAACTGCAGTTACACCGATAGCGCTATTTACTTCCTTCAGTGGTTTAGGCAGTATCCTGGGAAAGGGCTTACTTCTCTGCTGTTAATTCAAGCGAACCAGGGAGAACAAATAAGTGGAAGAATTAAAGCCTCATTGGATAAATCGTCAAGAAACAGTGTTTTCTACATTGCAGCATCTCAGCCCAGCGACTCTGCCACCTACTTCTGTGCTGTGAGGCACAGTGCATGA
1 ensembl gene 226161299 226217308 . - . 'gene_id "ENSSSCG00000028996"; gene_version "4"; gene_name "ALDH1A1"; gene_source "ensembl"; gene_biotype "protein_coding";'
In GTF file headers are: seqname source feature start end score strand frame attribute
Thanks in advance for tips on how to repair those files!
If you use
Genome: https://ftp.ensembl.org/pub/release-111/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
Annotation: https://ftp.ensembl.org/pub/release-111/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.111.gtf.gz
then everything should match.
You should not be using cDNA/transcriptome with STAR. That is appropriate to use with a program like
salmon
.So if i'm trying to work on gene expression I should use salmon instead of STAR? Or is this full dna file good for it? I'm sorry for newbie questions, but I'm not really familiar with biology (I'm IT guy doing things for my fiancee who don't know anything about IT. So no one of us can help each other... ;/ )
If you start with the transcriptome then you should use
salmon
. This would be an easier option if you are not a biologist.Otherwise use the genome and the STAR along with the GTF file. You could count at the same time with STAR. Or use the aligned file with a program like
featureCounts
+ GTF to get the counts.The expression analysis is done this way: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
So if I'm not mistaken (Again I'm sorry for being a newbie). Should I map mRNA on DNA or cDNA to count amount of genes expression? To my limited knowledge mapping mRNA on DNA may lead to a false increase in gene expression levels. So if I want to be accurate i should use salomon to map mRNA on cDNA? I'm lost. Sorry.
If you are starting with fastq sequence data then you can use either method. I don't think you said what kind of data you have.
it reminds me ...
That's actually really accurate. Unless it works out well. Then I'll be stuck with it forever... I think I should fail doing this... For this forum's sake. Every second post will be mine when I'm given serious task.