i'm trying to use STAR and there is a problem...
1
0
Entering edit mode
8 months ago
markusz ▴ 10

Hello. I'm new to BioIT and I have a problem with generating genome index or counting genes expression. I know that it's because of naming differences between in FastA and GTF files. How do I correct it? Below there are sample lines from first FastA, and then GTF files.

Sequence ID: ENSSSCT00000002339.4 cdna primary_assembly:Sscrofa11.1:AEMK02000555.1:34878:35168:1 gene:ENSSSCG00000035087.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene
Sequence: AAACAGCATGTGAATCAGAGCCACGAAGCCCTGAGCGTCCGAGAGGGAGACGGCTTGGTTCTCAACTGCAGTTACACCGATAGCGCTATTTACTTCCTTCAGTGGTTTAGGCAGTATCCTGGGAAAGGGCTTACTTCTCTGCTGTTAATTCAAGCGAACCAGGGAGAACAAATAAGTGGAAGAATTAAAGCCTCATTGGATAAATCGTCAAGAAACAGTGTTTTCTACATTGCAGCATCTCAGCCCAGCGACTCTGCCACCTACTTCTGTGCTGTGAGGCACAGTGCATGA



1   ensembl     gene    226161299   226217308   .   -   .   'gene_id "ENSSSCG00000028996"; gene_version "4"; gene_name "ALDH1A1"; gene_source "ensembl"; gene_biotype "protein_coding";'

In GTF file headers are: seqname source feature start end score strand frame attribute

Thanks in advance for tips on how to repair those files!

STAR • 814 views
ADD COMMENT
1
Entering edit mode

If you use

Genome: https://ftp.ensembl.org/pub/release-111/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
Annotation: https://ftp.ensembl.org/pub/release-111/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.111.gtf.gz

then everything should match.

You should not be using cDNA/transcriptome with STAR. That is appropriate to use with a program like salmon.

ADD REPLY
1
Entering edit mode

So if i'm trying to work on gene expression I should use salmon instead of STAR? Or is this full dna file good for it? I'm sorry for newbie questions, but I'm not really familiar with biology (I'm IT guy doing things for my fiancee who don't know anything about IT. So no one of us can help each other... ;/ )

ADD REPLY
1
Entering edit mode

If you start with the transcriptome then you should use salmon. This would be an easier option if you are not a biologist.

Otherwise use the genome and the STAR along with the GTF file. You could count at the same time with STAR. Or use the aligned file with a program like featureCounts + GTF to get the counts.

The expression analysis is done this way: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

ADD REPLY
0
Entering edit mode

So if I'm not mistaken (Again I'm sorry for being a newbie). Should I map mRNA on DNA or cDNA to count amount of genes expression? To my limited knowledge mapping mRNA on DNA may lead to a false increase in gene expression levels. So if I want to be accurate i should use salomon to map mRNA on cDNA? I'm lost. Sorry.

ADD REPLY
1
Entering edit mode

If you are starting with fastq sequence data then you can use either method. I don't think you said what kind of data you have.

ADD REPLY
0
Entering edit mode

it reminds me ...

ADD REPLY
0
Entering edit mode

That's actually really accurate. Unless it works out well. Then I'll be stuck with it forever... I think I should fail doing this... For this forum's sake. Every second post will be mine when I'm given serious task.

ADD REPLY
0
Entering edit mode
8 months ago
sed 's/^1\t/ENSSSCT00000002339.4\t/'   in.gtf > out.gtf
ADD COMMENT

Login before adding your answer.

Traffic: 2458 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6