Hello,
I am trying to align and mapping the RNA-seq FASTQ file (1 file) using STAR method for the first time in my life.
The genome file and gtf file for reference are about Apodemus agrarius.
The commands are as below in Linux.
# Apodemus agrarius genome download
genome_url = "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_964023405.1/download?include_annotation_type=GENOME_FASTA&filename=GCA_964023405.1.zip"
response = requests.get(genome_url)
with open("genome.zip", "wb") as f:
f.write(response.content)
# Unzipping the genome
!unzip -o genome.zip -d genome
# file move and name change
!mv genome/ncbi_dataset/data/GCA_964023405.1/*_genomic.fna genome/Apodemus_agrarius.fna
# Apodemus agrarius reference gtf file download
annotation_url = "https://ftp.ensembl.org/pub/rapid-release/species/Apodemus_agrarius/GCA_964023405.1/ensembl/geneset/2024_05/Apodemus_agrarius-GCA_964023405.1-2024_05-genes.gtf.gz"
annotation_path = os.path.join(work_dir, "Apodemus_agrarius-GCA_964023405.1-2024_05-genes.gtf.gz")
!wget -O {annotation_path} {annotation_url}
!gunzip {annotation_path}
fastq_dir = os.path.join(output_dir, 'fastq')
fastq_files = {
'SRR9990590_1.fastq.gz': 'sample_R1.fastq.gz',
'SRR9990590_2.fastq.gz': 'sample_R2.fastq.gz'
}
gtf_file = annotation_path.replace('.gz', '') # Unzipped GTF file path
!STAR --runMode genomeGenerate \
--genomeDir genome \
--genomeFastaFiles genome/Apodemus_agrarius.fna \
--sjdbGTFfile {gtf_file} \
--runThreadN 2 \
--genomeSAindexNbases 12 \
--genomeChrBinNbits 18 \
--sjdbOverhang 100
# RNA-seq mapping (Using GTF)
!STAR --genomeDir genome \
--readFilesIn fastq/sample_R1.fastq fastq/sample_R2.fastq \
--runThreadN 2 \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile {gtf_file} \
--outFileNamePrefix output_ \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--genomeSAsparseD 2 \
--limitBAMsortRAM 60000000000
However, the results are as below
Only 37 sequences were shown in the result tab file ("output_ReadsPerGene.out.tab")
Can you guess why that happened ? The very weird thing is when I see the gtf file in my R. The file describes as 37 sequences, although the when the gtf file is opened in excel, it shows 700000 rows of information.
<- Is this related to the problem ?
- I can't get it why 37 only...
The reults files are in this link. https://drive.google.com/drive/folders/1idiF6hoOX8SDQfhlGaIVH1RWtoRUAP7L?usp=drive_link[file link]3
Any suggestions for the code revision for solving the problem is very welcomed !!
Any suggestion will be much appreciated.
Never mix and match sequence/annotation sources.
Get genome from Ensembl as well and use that: https://ftp.ensembl.org/pub/rapid-release/species/Apodemus_agrarius/GCA_964023405.1/ensembl/genome/Apodemus_agrarius-GCA_964023405.1-unmasked.fa.gz
Thank you so much! After using the genome_unmasked file from Ensembl that you mentioned, I was able to get results for around 25,000 genes. However, I noticed that there are still quite a few unmapped reads, even though the log file indicates that the number of unmapped reads isn't that high (I'm not sure why there's a discrepancy).
Here are the log file results:ReadsPegGene.out.tab
Here are the log file results:
Do you think I need some pre-process steps (trimming ?) before this two STAR steps ?
QC control steps were used in other alignment method I found in the internet (such as fastqc -f fastq and Trimmomatic-0.39)
Here are the STAR commands I'm using:
Any advice you can offer would be greatly appreciated!
There are two different mappings here. First is mapping of reads to the genome and second (first table) is mapping of the mapped reads to gene models. It looks like many of your reads are not mapping in genes. You can confirm that by examining the alignments in a genome viewer like IGV. Unless your data has an obvious issue like DNA contamination there is not much you can do about those. Perhaps those are real transcripts which happen to be not known.