Question

STAR alignment for RNA seq - trouble occurred

0

Entering edit mode

4 days ago

ohtang7 ▴ 40

Hello,

I am trying to align and mapping the RNA-seq FASTQ file (1 file) using STAR method for the first time in my life.

The genome file and gtf file for reference are about Apodemus agrarius.

The commands are as below in Linux.

# Apodemus agrarius genome download
genome_url = "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_964023405.1/download?include_annotation_type=GENOME_FASTA&filename=GCA_964023405.1.zip"
response = requests.get(genome_url)
with open("genome.zip", "wb") as f:
    f.write(response.content)

# Unzipping the genome
!unzip -o genome.zip -d genome

# file move and name change
!mv genome/ncbi_dataset/data/GCA_964023405.1/*_genomic.fna genome/Apodemus_agrarius.fna

# Apodemus agrarius reference gtf file download
annotation_url = "https://ftp.ensembl.org/pub/rapid-release/species/Apodemus_agrarius/GCA_964023405.1/ensembl/geneset/2024_05/Apodemus_agrarius-GCA_964023405.1-2024_05-genes.gtf.gz"

annotation_path = os.path.join(work_dir, "Apodemus_agrarius-GCA_964023405.1-2024_05-genes.gtf.gz")

!wget -O {annotation_path} {annotation_url}
!gunzip {annotation_path}

fastq_dir = os.path.join(output_dir, 'fastq')
fastq_files = {
    'SRR9990590_1.fastq.gz': 'sample_R1.fastq.gz',
    'SRR9990590_2.fastq.gz': 'sample_R2.fastq.gz'
}

gtf_file = annotation_path.replace('.gz', '')  # Unzipped GTF file path

    !STAR --runMode genomeGenerate \
      --genomeDir genome \
      --genomeFastaFiles genome/Apodemus_agrarius.fna \
      --sjdbGTFfile {gtf_file} \
      --runThreadN 2 \
      --genomeSAindexNbases 12 \
      --genomeChrBinNbits 18 \
      --sjdbOverhang 100
    # RNA-seq mapping (Using GTF)
    !STAR --genomeDir genome \
      --readFilesIn fastq/sample_R1.fastq fastq/sample_R2.fastq \
      --runThreadN 2 \
      --outSAMtype BAM SortedByCoordinate \
      --quantMode GeneCounts \
      --sjdbGTFfile {gtf_file} \
      --outFileNamePrefix output_ \
      --outFilterMultimapNmax 20 \
      --alignSJoverhangMin 8 \
      --alignSJDBoverhangMin 1 \
      --outFilterMismatchNmax 999 \
      --genomeSAsparseD 2 \
      --limitBAMsortRAM 60000000000

However, the results are as below

enter image description here

Only 37 sequences were shown in the result tab file ("output_ReadsPerGene.out.tab")

Can you guess why that happened ? The very weird thing is when I see the gtf file in my R. The file describes as 37 sequences, although the when the gtf file is opened in excel, it shows 700000 rows of information.

<- Is this related to the problem ?

I can't get it why 37 only...

The reults files are in this link. https://drive.google.com/drive/folders/1idiF6hoOX8SDQfhlGaIVH1RWtoRUAP7L?usp=drive_link[file link]3

Any suggestions for the code revision for solving the problem is very welcomed !!

Any suggestion will be much appreciated. enter image description here

Alignment STAR RNA-seq gtf • 374 views

ADD COMMENT • link updated 2 days ago by GenoMax 147k • written 4 days ago by ohtang7 ▴ 40

4

Entering edit mode

Never mix and match sequence/annotation sources.

Get genome from Ensembl as well and use that: https://ftp.ensembl.org/pub/rapid-release/species/Apodemus_agrarius/GCA_964023405.1/ensembl/genome/Apodemus_agrarius-GCA_964023405.1-unmasked.fa.gz

ADD REPLY • link 4 days ago by GenoMax 147k

0

Entering edit mode

Thank you so much! After using the genome_unmasked file from Ensembl that you mentioned, I was able to get results for around 25,000 genes. However, I noticed that there are still quite a few unmapped reads, even though the log file indicates that the number of unmapped reads isn't that high (I'm not sure why there's a discrepancy).

Here are the log file results:ReadsPegGene.out.tab enter image description here

Here are the log file results: Unmapped rates_log_out_file.png

Do you think I need some pre-process steps (trimming ?) before this two STAR steps ?

QC control steps were used in other alignment method I found in the internet (such as fastqc -f fastq and Trimmomatic-0.39)
- Or just handling the parameter of STAR method would be possible option practically ?

STAR --runMode genomeGenerate \
      --genomeDir genome \
      --genomeFastaFiles genome/Apodemus_agrarius.fa \
      --sjdbGTFfile {gtf_file} \
      --runThreadN 2 \
      --genomeSAindexNbases 13 \
      --genomeChrBinNbits 18 \
      --sjdbOverhang 100

# RNA-seq reads mapping
!STAR --genomeDir genome \
      --readFilesIn fastq/sample_R1.fastq fastq/sample_R2.fastq \
      --runThreadN 2 \
      --outSAMtype BAM SortedByCoordinate \
      --quantMode GeneCounts \
      --sjdbGTFfile {gtf_file} \
      --outFileNamePrefix output_ \
      --outFilterMultimapNmax 10 \
      --alignSJoverhangMin 8 \
      --alignSJDBoverhangMin 1 \
      --outFilterMismatchNmax 2 \
      --outFilterScoreMinOverLread 0.9 \
      --outFilterMatchNminOverLread 0.9 \
      --alignIntronMax 1000000 \
      --outFilterMismatchNoverLmax 0.01 \
      --outFilterMismatchNoverReadLmax 0.05 \
      --genomeSAsparseD 2 \
      --limitBAMsortRAM 60000000000

Here are the STAR commands I'm using:

Any advice you can offer would be greatly appreciated!

ADD REPLY • link updated 2 days ago by GenoMax 147k • written 3 days ago by ohtang7 ▴ 40

1

Entering edit mode

There are two different mappings here. First is mapping of reads to the genome and second (first table) is mapping of the mapped reads to gene models. It looks like many of your reads are not mapping in genes. You can confirm that by examining the alignments in a genome viewer like IGV. Unless your data has an obvious issue like DNA contamination there is not much you can do about those. Perhaps those are real transcripts which happen to be not known.

ADD REPLY • link 2 days ago by GenoMax 147k

score 2 · Answer 1 · 2024-11-25

Most likely, sequence names from the NCBI FASTA does not match the sequence names from the Ensembl GTF.

From STAR docs:

2.2.2 Which annotations to use? The use of the most comprehensive annotations for a given species is strongly recommended. Very importantly, chromosome names in the annotations GTF file have to match chromosome names in the FASTA genome sequence files. For example, one can use ENSEMBL FASTA files with ENSEMBL GTF files, and UCSC FASTA files with UCSC FASTA files. However, since UCSC uses chr1, chr2, ... naming convention, and ENSEMBL uses 1, 2, ... naming, the ENSEMBL and UCSC FASTA and GTF files cannot be mixed together, unless chromosomes are renamed to match between the FASTA anf GTF files.