Question

RNASeq featureCounts Help Needed

0

Entering edit mode

24 days ago

SomeOne ▴ 170

HI.

Species: Fungal samples

Background

I have generated a Nanopore + Illumina based genome assembly and annotated it via Funannotate using RNASeq data. Now for some of the rna samples i have to do DEG analysis. I want to identify the genes Up/Down regulated under infection conditions.

What I ahve done so far

Converted my annotation.gff3 to annotation.gtf using agat_convert_sp_gff2gtf.pl via command
```
agat_convert_sp_gff2gtf.pl --gff $Genome_gff3 -o ${Genome_gff3}.gtf
```

QC-cleaned mRNA reads mapped to Ref.fa using Hisat2 via commands

hisat2_extract_splice_sites.py ${Genome_gff3}.gtf > ${Genome_gff3}.splicesites.tsv
hisat2_extract_exons.py ${Genome_gff3}.gtf > ${Genome_gff3}.exons.tsv
# Build-Index
hisat2-build -p $threads --ss ${Genome_gff3}.splicesites.tsv --exon ${Genome_gff3}.exons.tsv $Genome_fasta $Genome_fasta
# Alignment
hisat2 --threads $threads --time --new-summary -x $Genome_fasta -1 $input_reads_F -2 $input_reads_R -S $Alignment_dir/${input_reads_name}-vs-${ref_name}.sam
# Sort SAM and converted to BAM
samtools sort -@ 16 -o $Alignment_dir/${input_reads_name}-vs-${ref_name}.sorted.bam $Alignment_dir/${input_reads_name}-vs-${ref_name}.sam

Now i have .BAM files for both Control and Wild-type and my annotation.gtf file has following features in column-3

CDS
exon
five_prime_UTR
gene
mRNA
three_prime_UTR
tRNA

and head of file looks like this

##gtf-version X
# GFF-like GTF i.e. not checked against any GTF specification. Conversion based on GFF input, standardised by AGAT.
NP02_scf_1      funannotate     gene    5051    8878    .       -       .       gene_id "NP02_000001"; ID "NP02_000001";
NP02_scf_1      funannotate     mRNA    5051    8878    .       -       .       gene_id "NP02_000001"; transcript_id "NP02_000001-T1"; Dbxref "PFAM:PF00270" "InterPro:IPR011545" "PFAM:PF00271" "InterPro:IPR001650" "InterPro:IPR027417" "InterPro:IPR014001"; ID "NP02_000001-T1"; Ontology_term "GO:0005524" "GO:0003676" "GO:0009378" "GO:0043138" "GO:0000724" "GO:0006281" "GO:0032508" "GO:0006310" "GO:0006268" "GO:0005694" "GO:0005737"; Parent "NP02_000001"; note "EggNog:ENOG503Q15N" "COG:A"; product "hypothetical protein";
NP02_scf_1      funannotate     exon    5051    8878    .       -       .       gene_id "NP02_000001"; transcript_id "NP02_000001-T1"; ID "NP02_000001-T1.exon1"; Parent "NP02_000001-T1";
NP02_scf_1      funannotate     CDS     5051    8878    .       -       0       gene_id "NP02_000001"; transcript_id "NP02_000001-T1"; ID "NP02_000001-T1.cds1"; Parent "NP02_000001-T1";

FIRST

i want to run featureCounts to generate count matrix. but i am getting confused on the -t flag of featureCounts. should i use -t gene or -t exon or -t mRNA

i am unable to understand the difference here.

Second

when i run featureCounts via this command

featureCounts -a ${Genome_gff3}.gtf -G $Genome_fasta -o $CountMatrix_dir/${input_reads_name}-vs-${ref_name}_featureCounts.tsv \
    $Alignment_dir/${input_reads_name}-vs-${ref_name}.sorted.bam \
    -F GTF -t mRNA -g gene_id -s 0 -p --countReadPairs -B -C --verbose \
    -T $threads 2>&1 | tee -a $CountMatrix_dir/${input_reads_name}-vs-${ref_name}_featureCounts.tsv.summary.txt

i get the following error

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id' 
An example of attributes included in your GTF annotation is 'gene_id "NP02_000001"; transcript_id "NP02_000001-T1"; Dbxref "PFAM:PF00270" "InterPro:IPR011545" "PFAM:PF00271" "InterPro:IPR001650" "InterPro:IPR027417" "InterPro:IPR014001"; ID "NP02_000001-T1"; Ontology_term "GO:0005524" "GO:0003676" "GO:0009378" "GO:0043138" "GO:0000724" "GO:0006281" "GO:0032508" "GO:0006310" "GO:0006268" "GO:0005694" "GO:0005737"; Parent "NP02_000001"; note "EggNog:ENOG503Q15N" "COG:A"; product "hypothetical protein";'.

it is not able to grab the gene_id. i guess it might be because of the extra stuff in mRNA lines after the gene_id "NP02_000001"; transcript_id "NP02_000001-T1"; part. which i am not able to remove.

any possible way to remove it via any tools or anything else ?

My major concern is the -t flag in featureCounts. kindly help me understand what should be used here.

Thank YOu

fungi featurecounts RNASeq • 467 views

ADD COMMENT • link 24 days ago by SomeOne ▴ 170

1

Entering edit mode

Now for some of the rna samples i have to do DEG analysis.

Can you clarify if the data you are using is short or long read (actual RNAseq). If it is long read then you should be using a workflow such as https://github.com/epi2me-labs/wf-transcriptomes

it is not able to grab the gene_id.

The chromosome names are NP02_scf_1 (what is in your GTF file). Do they match what is in the reference sequence file?

but i am getting confused on the -t flag of featureCounts.

For most applications people want to consider the reads at the feature level (e.g. exon) but then summarize the counts at the gene level (meta feature), so you have one number per gene. Are you interested in doing differential transcript level analysis or just gene level.

ADD REPLY • link 24 days ago by GenoMax 151k

0

Entering edit mode

Hi. GenoMax thank you for responding.

mRNA data is short read illumina.
NP02_scf_1 is scaffold name in my genome.fasta and genome.gtf and .gff files.
I want to fo gene level DEG analysis. If would be helpful to know the options to be used for transcript level analysis.

Thank you.

ADD REPLY • link 24 days ago by SomeOne ▴ 170

1

Entering edit mode

See answers here for question #3: https://support.bioconductor.org/p/78975/ and also https://support.bioconductor.org/p/9147887/#9147987

You may want to use salmon as well, in case you want to do transcriptome based analysis.