Hello,
I am trying to align and mapping the RNA-seq FASTQ file (1 file) using STAR method for the first time in my life.
The genome file and gtf file for reference are about Apodemus agrarius.
The commands are as below in Linux.
# Apodemus agrarius genome download
genome_url = "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_964023405.1/download?include_annotation_type=GENOME_FASTA&filename=GCA_964023405.1.zip"
response = requests.get(genome_url)
with open("genome.zip", "wb") as f:
f.write(response.content)
# Unzipping the genome
!unzip -o genome.zip -d genome
# file move and name change
!mv genome/ncbi_dataset/data/GCA_964023405.1/*_genomic.fna genome/Apodemus_agrarius.fna
# Apodemus agrarius reference gtf file download
annotation_url = "https://ftp.ensembl.org/pub/rapid-release/species/Apodemus_agrarius/GCA_964023405.1/ensembl/geneset/2024_05/Apodemus_agrarius-GCA_964023405.1-2024_05-genes.gtf.gz"
annotation_path = os.path.join(work_dir, "Apodemus_agrarius-GCA_964023405.1-2024_05-genes.gtf.gz")
!wget -O {annotation_path} {annotation_url}
!gunzip {annotation_path}
fastq_dir = os.path.join(output_dir, 'fastq')
fastq_files = {
'SRR9990590_1.fastq.gz': 'sample_R1.fastq.gz',
'SRR9990590_2.fastq.gz': 'sample_R2.fastq.gz'
}
gtf_file = annotation_path.replace('.gz', '') # Unzipped GTF file path
!STAR --runMode genomeGenerate \
--genomeDir genome \
--genomeFastaFiles genome/Apodemus_agrarius.fna \
--sjdbGTFfile {gtf_file} \
--runThreadN 2 \
--genomeSAindexNbases 12 \
--genomeChrBinNbits 18 \
--sjdbOverhang 100
# RNA-seq mapping (Using GTF)
!STAR --genomeDir genome \
--readFilesIn fastq/sample_R1.fastq fastq/sample_R2.fastq \
--runThreadN 2 \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile {gtf_file} \
--outFileNamePrefix output_ \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--genomeSAsparseD 2 \
--limitBAMsortRAM 60000000000
However, the results are as below
Only 37 sequences were shown in the result tab file ("output_ReadsPerGene.out.tab")
Can you guess why that happened ? The very weird thing is when I see the gtf file in my R. The file describes as 37 sequences, although the when the gtf file is opened in excel, it shows 700000 rows of information.
<- Is this related to the problem ?
- I can't get it why 37 only...
The reults files are in this link. https://drive.google.com/drive/folders/1idiF6hoOX8SDQfhlGaIVH1RWtoRUAP7L?usp=drive_link[file link]3
Any suggestions for the code revision for solving the problem is very welcomed !!
Any suggestion will be much appreciated.
Never mix and match sequence/annotation sources.
Get genome from Ensembl as well and use that: https://ftp.ensembl.org/pub/rapid-release/species/Apodemus_agrarius/GCA_964023405.1/ensembl/genome/Apodemus_agrarius-GCA_964023405.1-unmasked.fa.gz