Using featureCounts on HISTA2 aligned data for the tomato genome
1
0
Entering edit mode
3 months ago

Hello there.

I am new with Bioinformatics and I am currently working on quantifying my genes in the RNA data that I received. My workflow is as follows: -Illumina sequenced data was received -FastQC was done to determine quality -Trimmomatic was used to trim my data -HISAT2 was used to allign my data to a genome, paired ended data Now I am at the featurecounts step, where I am currently getting 0% alignment.

Here is some information on what I have done so far:

$ featureCounts -p -O -T 4 --countReadPairs -a /home/antzeltheron/RNAseq_pipeline/HISAT2/ITAG4.0_gene_models.gtf -o /home/antzeltheron/RNAseq_pipeline/quants/ResC58_15_Rep1_R1_featurecounts.txt /home/antzeltheron/RNAseq_pipeline/HISAT2/ResC58_15_Rep1.sam

This gives me the following results:

enter image description here

As you can see, I have no successful alignments

Here are the HISAT2 results of 2 samples I ran:

enter image description here

For the HISAT run, the genome data I used was ITAG4.0_cDNA.fasta, which I used to build my index (hisat2_build command) which then created 8 files of .ht2 format.

Here is the SAM file after the HISAT2 run (only a few first lines):

enter image description here

And the GTF file's first 20 lines (ITAG4.0_gene-models.gtf): enter image description here

HISAT2 genome featurecounts alignment • 564 views
ADD COMMENT
0
Entering edit mode

Maybe you went over the character limit and lost the HISAT2 results. Could you please provide them?

ADD REPLY
0
Entering edit mode

Hi, so sorry about that. The post was submitted before I was done

ADD REPLY
0
Entering edit mode
3 months ago
GenoMax 147k

You are aligning to tomato genes/transcripts (looking at the length of the sequences in your SAM header) where as your GTF file is for the genome. In order to use featureCounts your reference genome names need to match. You will need to align to the genome in order to use the GTF. If you want to align to/quantify against transcriptome then use a program like salmon.

ADD COMMENT
0
Entering edit mode

So if I understand you correctly, I didn't align to the genome when I first ran my HISAT2, but rather I aligned them to genes? Then, if I want to align to the genome, I will have to rerun my Hisat2 with the genome?

ADD REPLY
1
Entering edit mode

That looks like that is what happened. When you align you should use a genome reference that will match the reference names in the GTF file you are planning to use i.e. obtain your genome and annotation from the same source.

ADD REPLY
0
Entering edit mode

I used the following link to get my gtf file

https://solgenomics.net/ftp/tomato_genome/

However, I can't seem to find a tomato genome with it's corresponding gtf file. For HISAT2, I had to get the genome indices, which I then used to build my own index for HISAT2.

There is also this link https://data.jgi.doe.gov/refine-download/phytozome?organism=Slycopersicum&expanded=796&_gl=1*4pu9z2*_ga*MTExNDc1NzUxOC4xNzIxMTIyMDU4*_ga_YBLMHYR3C2*MTcyMzEwNDA3OS42LjEuMTcyMzEwNDExMS4wLjAuMA but I don't know which to use for HISAT2 and which to use for featurecounts

ADD REPLY
0
Entering edit mode

However, I can't seem to find a tomato genome with it's corresponding gtf file.

Should be here: https://solgenomics.net/ftp/tomato_genome/assembly/current_build/. This is the genome https://solgenomics.net/ftp/tomato_genome/assembly/current_build/S_lycopersicum_chromosomes.4.00.fa

Check the reference names to be sure. Should be SL4*.

ADD REPLY
0
Entering edit mode

Thank you. If you don't mind me asking, can you please check the second link I sent in my previous reply, they have a more recent release of the tomato genome, I just need to know which files to use for which program please.

ADD REPLY
0
Entering edit mode

Slycopersicum_796_ITAG5.0.fa.gz should be the genome (I don't know what is the difference between ITAG and SL5, but there is corresponding SL5 version).

Slycopersicum_796_ITAG5.0.gene_exons.gff3.gz should be the annotation since it appears to be larger than Slycopersicum_796_ITAG5.0.gene.gff3.gz. More details may be in Slycopersicum_796_ITAG5.0.annotation_info.txt (which requires an account to view, which I assume you have).

ADD REPLY
0
Entering edit mode

Slycopersicum_796_ITAG5.0.cds.fa.gz and Slycopersicum_796_ITAG5.0.cds_primaryTranscriptOnly.fa.gz Nucleotide FASTA format file of all gene coding sequences, with or without alternative splice variants

Slycopersicum_796_SL5.0.fa.gz Nucleotide FASTA format of the current genomic assembly

Slycopersicum_796_ITAG5.0.gene_exons.gff3.gz GFF3 format representation of all mRNA sequences as above, but with exon subfeatures. Genomic coordinates are relative to the reference sequence in column 1

Slycopersicum_796_ITAG5.0.gene.gff3.gz GFF3 format representation of all mRNA sequences (UTR, CDS). Genomic coordinates are relative to the reference sequence in column 1

This is the information on the files.

So seeing that SL is the genome, I will use it for HISAT2, and then the corresponding ITAG5.0 gene_exons.gff3 for featurecounts?

ADD REPLY

Login before adding your answer.

Traffic: 2552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6