Question

Is there any set threshold for hisat2 alignment and featurecount assignment to select samples further?

0

Entering edit mode

6 months ago

sukeshinik5 ▴ 50

Is there any set threshold for hisat2 alignment and featurecount assignment to select samples further? Please find the script for Hisat2,

samples_dir="/xx321/samples/data"
aligned_dir="/xx321/samples/Aligned_trimmed_data"
hisat2_index="/xx321/samples/HISAT2/grch38/genome"
log_file="/xx321/samples/Trimm_alignment_log.log"
splice_sites_file= "/xx321/samples/Splice_Sites/gencode.v38.splicesites.txt"

# Loop through multiple samples
for R1_samples in ${samples_dir}/*_R1.fastq.gz; do
    sample_name=$(basename "$R1_samples" _R1.fastq.gz)
    R2_samples="${R1_samples/_R1/_R2}"

# HISAT2 alignment command with trimming options and capturing alignment statistics
    hisat2 -p 15 \
        -x "$hisat2_index" \
        --trim5 5 --trim3 5 \
        --ss "$splice_sites_file" \
        -1 "$R1_samples" \
        -2 "$R2_samples" \
        -S /dev/stdout 2> "${aligned_dir}/${sample_name}_hisat2.log" | \
    samtools view -bS - > "${aligned_dir}/${sample_name}_trimmed.aligned.bam"

Following is the output percent for fall in between 85% to 97%. However, when below featureCount command is used the assignment fall in between 65% to 89%. only one sample shows 55% assignment. how to improve the assignment.

annotation_file="/xx321/samples/Annotation_files/gencode.v38.annotation.gtf"
output_file="/xx321/samples/sample.trimmed.count.txt"
input_directory="/xx321/samples/Aligned_trimmed_data"
Splice_site_file="/xx321/samples/Splice_Sites/splicesites.bed"
log_file="/xx321/samples/featureCounts.log"

# Run featureCounts for all BAM files and save output to log file

featureCounts -p -T 14 -a "$annotation_file" -o "$output_file" --largestOverlap -t exon -g gene_id -M -O -J -s 0 "$input_directory"/*_trimmed.sorted.bam > "$log_file" 2>&1

Sorry for such naive questions I am new to the RNA sequencing analysis. it would be great if you could go through the script and let me know if anything needs improvement. Also, please find the script used to create a splice_site_file which was converted to .bed for including in featureCount as -J (hopefully I am doing right).

awk '$3 == "exon"' "${gtf_file}" | awk '{if ($7 == "+") print $1, $5, $4+1; else if ($7 == "-") print $1, $4, $5+1}' | sort -k1,1 -k2,2n | uniq > "${output_file}"

Is there a way to improve assignment in featurecount? also, on what basis we should have samples included in the analysis?

Thank you,

hisat2 alignment featurecount • 609 views

ADD COMMENT • link 6 months ago by sukeshinik5 ▴ 50

1

Entering edit mode

I would not go by these lowlevel stats but see how data look in a PCA. If outliers match with samples that have low assignment percentages then maybe consider removing them. If they cluster reasonably, maybe keep. There is no "fixed" threshold.

ADD REPLY • link 6 months ago by ATpoint 86k

0

Entering edit mode

Thank you for such prompt response. Is there a way to improve the assignment in featureCount?

ADD REPLY • link 6 months ago by sukeshinik5 ▴ 50

0

Entering edit mode

I've never bothered with hisat2 but use STAR HISAT2 V.S. STAR

ADD REPLY • link 6 months ago by ATpoint 86k

0

Entering edit mode

Sure, i will try using that Thankyou .

ADD REPLY • link 6 months ago by sukeshinik5 ▴ 50