STAR alignment for RNA seq - trouble occurred
1
0
Entering edit mode
5 hours ago
ohtang7 ▴ 40

Hello,

I am trying to align and mapping the RNA-seq FASTQ file (1 file) using STAR method for the first time in my life.

The genome file and gtf file for reference are about Apodemus agrarius.

The commands are as below in Linux.

# Apodemus agrarius genome download
genome_url = "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_964023405.1/download?include_annotation_type=GENOME_FASTA&filename=GCA_964023405.1.zip"
response = requests.get(genome_url)
with open("genome.zip", "wb") as f:
    f.write(response.content)

# Unzipping the genome
!unzip -o genome.zip -d genome

# file move and name change
!mv genome/ncbi_dataset/data/GCA_964023405.1/*_genomic.fna genome/Apodemus_agrarius.fna

# Apodemus agrarius reference gtf file download
annotation_url = "https://ftp.ensembl.org/pub/rapid-release/species/Apodemus_agrarius/GCA_964023405.1/ensembl/geneset/2024_05/Apodemus_agrarius-GCA_964023405.1-2024_05-genes.gtf.gz"

annotation_path = os.path.join(work_dir, "Apodemus_agrarius-GCA_964023405.1-2024_05-genes.gtf.gz")

!wget -O {annotation_path} {annotation_url}
!gunzip {annotation_path}

fastq_dir = os.path.join(output_dir, 'fastq')
fastq_files = {
    'SRR9990590_1.fastq.gz': 'sample_R1.fastq.gz',
    'SRR9990590_2.fastq.gz': 'sample_R2.fastq.gz'
}

gtf_file = annotation_path.replace('.gz', '')  # Unzipped GTF file path

    !STAR --runMode genomeGenerate \
      --genomeDir genome \
      --genomeFastaFiles genome/Apodemus_agrarius.fna \
      --sjdbGTFfile {gtf_file} \
      --runThreadN 2 \
      --genomeSAindexNbases 12 \
      --genomeChrBinNbits 18 \
      --sjdbOverhang 100
    # RNA-seq mapping (Using GTF)
    !STAR --genomeDir genome \
      --readFilesIn fastq/sample_R1.fastq fastq/sample_R2.fastq \
      --runThreadN 2 \
      --outSAMtype BAM SortedByCoordinate \
      --quantMode GeneCounts \
      --sjdbGTFfile {gtf_file} \
      --outFileNamePrefix output_ \
      --outFilterMultimapNmax 20 \
      --alignSJoverhangMin 8 \
      --alignSJDBoverhangMin 1 \
      --outFilterMismatchNmax 999 \
      --genomeSAsparseD 2 \
      --limitBAMsortRAM 60000000000

However, the results are as below

enter image description here

enter image description here

Only 37 sequences were shown in the result tab file ("output_ReadsPerGene.out.tab")

Can you guess why that happened ? The very weird thing is when I see the gtf file in my R. The file describes as 37 sequences, although the when the gtf file is opened in excel, it shows 700000 rows of information.

<- Is this related to the problem ?

  • I can't get it why 37 only...

The reults files are in this link. https://drive.google.com/drive/folders/1idiF6hoOX8SDQfhlGaIVH1RWtoRUAP7L?usp=drive_link[file link]3

Any suggestions for the code revision for solving the problem is very welcomed !!

Any suggestion will be much appreciated. enter image description here

Alignment STAR RNA-seq gtf • 93 views
ADD COMMENT
2
Entering edit mode
ADD REPLY
1
Entering edit mode
4 hours ago
barslmn ★ 2.3k

Most likely, sequence names from the NCBI FASTA does not match the sequence names from the Ensembl GTF.

From STAR docs:

2.2.2 Which annotations to use? The use of the most comprehensive annotations for a given species is strongly recommended. Very importantly, chromosome names in the annotations GTF file have to match chromosome names in the FASTA genome sequence files. For example, one can use ENSEMBL FASTA files with ENSEMBL GTF files, and UCSC FASTA files with UCSC FASTA files. However, since UCSC uses chr1, chr2, ... naming convention, and ENSEMBL uses 1, 2, ... naming, the ENSEMBL and UCSC FASTA and GTF files cannot be mixed together, unless chromosomes are renamed to match between the FASTA anf GTF files.

ADD COMMENT

Login before adding your answer.

Traffic: 2635 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6