Is there any set threshold for hisat2 alignment and featurecount assignment to select samples further? Please find the script for Hisat2,
samples_dir="/xx321/samples/data"
aligned_dir="/xx321/samples/Aligned_trimmed_data"
hisat2_index="/xx321/samples/HISAT2/grch38/genome"
log_file="/xx321/samples/Trimm_alignment_log.log"
splice_sites_file= "/xx321/samples/Splice_Sites/gencode.v38.splicesites.txt"
# Loop through multiple samples
for R1_samples in ${samples_dir}/*_R1.fastq.gz; do
sample_name=$(basename "$R1_samples" _R1.fastq.gz)
R2_samples="${R1_samples/_R1/_R2}"
# HISAT2 alignment command with trimming options and capturing alignment statistics
hisat2 -p 15 \
-x "$hisat2_index" \
--trim5 5 --trim3 5 \
--ss "$splice_sites_file" \
-1 "$R1_samples" \
-2 "$R2_samples" \
-S /dev/stdout 2> "${aligned_dir}/${sample_name}_hisat2.log" | \
samtools view -bS - > "${aligned_dir}/${sample_name}_trimmed.aligned.bam"
Following is the output percent for fall in between 85% to 97%. However, when below featureCount command is used the assignment fall in between 65% to 89%. only one sample shows 55% assignment. how to improve the assignment.
annotation_file="/xx321/samples/Annotation_files/gencode.v38.annotation.gtf"
output_file="/xx321/samples/sample.trimmed.count.txt"
input_directory="/xx321/samples/Aligned_trimmed_data"
Splice_site_file="/xx321/samples/Splice_Sites/splicesites.bed"
log_file="/xx321/samples/featureCounts.log"
# Run featureCounts for all BAM files and save output to log file
featureCounts -p -T 14 -a "$annotation_file" -o "$output_file" --largestOverlap -t exon -g gene_id -M -O -J -s 0 "$input_directory"/*_trimmed.sorted.bam > "$log_file" 2>&1
Sorry for such naive questions I am new to the RNA sequencing analysis. it would be great if you could go through the script and let me know if anything needs improvement. Also, please find the script used to create a splice_site_file which was converted to .bed for including in featureCount as -J (hopefully I am doing right).
awk '$3 == "exon"' "${gtf_file}" | awk '{if ($7 == "+") print $1, $5, $4+1; else if ($7 == "-") print $1, $4, $5+1}' | sort -k1,1 -k2,2n | uniq > "${output_file}"
Is there a way to improve assignment in featurecount? also, on what basis we should have samples included in the analysis?
Thank you,
I would not go by these lowlevel stats but see how data look in a PCA. If outliers match with samples that have low assignment percentages then maybe consider removing them. If they cluster reasonably, maybe keep. There is no "fixed" threshold.
Thank you for such prompt response. Is there a way to improve the assignment in featureCount?
I've never bothered with hisat2 but use STAR HISAT2 V.S. STAR
Sure, i will try using that Thankyou .