HI.
Species: Fungal samples
Background
I have generated a Nanopore + Illumina based genome assembly and annotated it via Funannotate using RNASeq data. Now for some of the rna samples i have to do DEG analysis. I want to identify the genes Up/Down regulated under infection conditions.
What I ahve done so far
Converted my
annotation.gff3
toannotation.gtf
usingagat_convert_sp_gff2gtf.pl
via commandagat_convert_sp_gff2gtf.pl --gff $Genome_gff3 -o ${Genome_gff3}.gtf
QC-cleaned mRNA reads mapped to Ref.fa using Hisat2 via commands
hisat2_extract_splice_sites.py ${Genome_gff3}.gtf > ${Genome_gff3}.splicesites.tsv hisat2_extract_exons.py ${Genome_gff3}.gtf > ${Genome_gff3}.exons.tsv # Build-Index hisat2-build -p $threads --ss ${Genome_gff3}.splicesites.tsv --exon ${Genome_gff3}.exons.tsv $Genome_fasta $Genome_fasta # Alignment hisat2 --threads $threads --time --new-summary -x $Genome_fasta -1 $input_reads_F -2 $input_reads_R -S $Alignment_dir/${input_reads_name}-vs-${ref_name}.sam # Sort SAM and converted to BAM samtools sort -@ 16 -o $Alignment_dir/${input_reads_name}-vs-${ref_name}.sorted.bam $Alignment_dir/${input_reads_name}-vs-${ref_name}.sam
Now i have .BAM files for both Control
and Wild-type
and my annotation.gtf file has following features in column-3
CDS
exon
five_prime_UTR
gene
mRNA
three_prime_UTR
tRNA
and head of file looks like this
##gtf-version X
# GFF-like GTF i.e. not checked against any GTF specification. Conversion based on GFF input, standardised by AGAT.
NP02_scf_1 funannotate gene 5051 8878 . - . gene_id "NP02_000001"; ID "NP02_000001";
NP02_scf_1 funannotate mRNA 5051 8878 . - . gene_id "NP02_000001"; transcript_id "NP02_000001-T1"; Dbxref "PFAM:PF00270" "InterPro:IPR011545" "PFAM:PF00271" "InterPro:IPR001650" "InterPro:IPR027417" "InterPro:IPR014001"; ID "NP02_000001-T1"; Ontology_term "GO:0005524" "GO:0003676" "GO:0009378" "GO:0043138" "GO:0000724" "GO:0006281" "GO:0032508" "GO:0006310" "GO:0006268" "GO:0005694" "GO:0005737"; Parent "NP02_000001"; note "EggNog:ENOG503Q15N" "COG:A"; product "hypothetical protein";
NP02_scf_1 funannotate exon 5051 8878 . - . gene_id "NP02_000001"; transcript_id "NP02_000001-T1"; ID "NP02_000001-T1.exon1"; Parent "NP02_000001-T1";
NP02_scf_1 funannotate CDS 5051 8878 . - 0 gene_id "NP02_000001"; transcript_id "NP02_000001-T1"; ID "NP02_000001-T1.cds1"; Parent "NP02_000001-T1";
FIRST
i want to run featureCounts
to generate count matrix. but i am getting confused on the -t
flag of featureCounts. should i use -t gene
or -t exon
or -t mRNA
i am unable to understand the difference here.
Second
when i run featureCounts via this command
featureCounts -a ${Genome_gff3}.gtf -G $Genome_fasta -o $CountMatrix_dir/${input_reads_name}-vs-${ref_name}_featureCounts.tsv \
$Alignment_dir/${input_reads_name}-vs-${ref_name}.sorted.bam \
-F GTF -t mRNA -g gene_id -s 0 -p --countReadPairs -B -C --verbose \
-T $threads 2>&1 | tee -a $CountMatrix_dir/${input_reads_name}-vs-${ref_name}_featureCounts.tsv.summary.txt
i get the following error
ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
An example of attributes included in your GTF annotation is 'gene_id "NP02_000001"; transcript_id "NP02_000001-T1"; Dbxref "PFAM:PF00270" "InterPro:IPR011545" "PFAM:PF00271" "InterPro:IPR001650" "InterPro:IPR027417" "InterPro:IPR014001"; ID "NP02_000001-T1"; Ontology_term "GO:0005524" "GO:0003676" "GO:0009378" "GO:0043138" "GO:0000724" "GO:0006281" "GO:0032508" "GO:0006310" "GO:0006268" "GO:0005694" "GO:0005737"; Parent "NP02_000001"; note "EggNog:ENOG503Q15N" "COG:A"; product "hypothetical protein";'.
it is not able to grab the gene_id
.
i guess it might be because of the extra stuff in mRNA lines after the gene_id "NP02_000001"; transcript_id "NP02_000001-T1";
part. which i am not able to remove.
any possible way to remove it via any tools or anything else ?
My major concern is the -t flag in featureCounts. kindly help me understand what should be used here.
Thank YOu
Can you clarify if the data you are using is short or long read (actual RNAseq). If it is long read then you should be using a workflow such as https://github.com/epi2me-labs/wf-transcriptomes
The chromosome names are
NP02_scf_1
(what is in your GTF file). Do they match what is in the reference sequence file?For most applications people want to consider the reads at the feature level (e.g. exon) but then summarize the counts at the
gene
level (meta feature), so you have one number per gene. Are you interested in doing differential transcript level analysis or just gene level.Hi. GenoMax thank you for responding.
Thank you.
See answers here for question #3: https://support.bioconductor.org/p/78975/ and also https://support.bioconductor.org/p/9147887/#9147987
You may want to use
salmon
as well, in case you want to do transcriptome based analysis.It was relly helpfull. Thank You.