Counts all read zero after alignment?
0
0
Entering edit mode
3.3 years ago
tul66893 • 0

Hi everyone,

I'm looking for some help because I just ran my alignment code last night for 12 RNAseq samples. After taking a look at the counts that were generated this morning, all of them are reading 0 for every gene. Oddly, here are the results in the terminal...

95887304 reads; of these:
  95887304 (100.00%) were unpaired; of these:
    2313614 (2.41%) aligned 0 times
    76638154 (79.93%) aligned exactly 1 time
    16935536 (17.66%) aligned >1 times
97.59% overall alignment rate
[bam_sort_core] merging from 32 files and 16 in-memory blocks...

Seems like the code is working for alignment but giving odd counts. Anyone have any ideas where to look?

alignment rna-seq • 2.2k views
ADD COMMENT
1
Entering edit mode

Make sure the GTF and fasta reference file have the same chromosome names, not like 1 in one file and chr1 in the other.

ADD REPLY
0
Entering edit mode

Here is an example of the first ten lines from of the FASTA files.

    @VH00577:1:AAAGGJGHV:1:1101:47524:1057 1:N:0:TCCACGTT
NCGAAGAATCAGAATAGGTGTTGATAGAGAATTGGGTCTCCACCTCCAGCGGGGTCGAAG
+
#---CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@VH00577:1:AAAGGJGHV:1:1101:51539:1057 1:N:0:TCCACGTT
NCCGATTTTGCGAATCTTCTCAGCTTCTGCTTGTGCCAAGAGGACTTGTTTCACCTTTTC
+
#-CC;CC;-C;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@VH00577:1:AAAGGJGHV:1:1101:65475:1057 1:N:0:TCCACGTT
GGGTGCTATATAAGACAATTTTGTCATTTTCAAATATTGTTCTTCGTAACATTTATAAGA

I don't know where the chromosome names exactly are. Here is the first couple of lines form the GTF file. I don't see the "chr1" designation. Is it possible I am using an incorrect GTF file? It's titled Rattus_norvegicusRnor_6.0.104.gtf

#!genome-build Rnor_6.0
#!genome-version Rnor_6.0
#!genome-date 2014-07
#!genome-build-accession GCA_000001895.4
#!genebuild-last-updated 2017-01
1   havana  gene    261992430   261993135   .   -   .   gene_id "ENSRNOG00000054910"; gene_version "1"; gene_name "AC096317.3"; gene_source "havana"; gene_biotype "sense_intronic";
1   havana  transcript  261992430   261993135   .   -   .   gene_id "ENSRNOG00000054910"; gene_version "1"; transcript_id "ENSRNOT00000079974"; transcript_version "1"; gene_name "AC096317.3"; gene_source "havana"; gene_biotype "sense_intronic"; transcript_name "AC096317.3-201"; transcript_source "havana"; transcript_biotype "sense_intronic";
1   havana  exon    261992430   261993135   .   -   .   gene_id "ENSRNOG00000054910"; gene_version "1"; transcript_id "ENSRNOT00000079974"; transcript_version "1"; exon_number "1"; gene_name "AC096317.3"; gene_source "havana"; gene_biotype "sense_intronic"; transcript_name "AC096317.3-201"; transcript_source "havana"; transcript_biotype "sense_intronic"; exon_id "ENSRNOE00000533050"; exon_version "1";
1   ensembl_havana  gene    261989178   262015153   .   -   .   gene_id "ENSRNOG00000045838"; gene_version "2"; gene_name "Hps1"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
1   ensembl_havana  transcript  261989178   262014066   .   -   .   gene_id "ENSRNOG00000045838"; gene_version "2"; transcript_id "ENSRNOT00000087083"; transcript_version "1"; gene_name "Hps1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Hps1-208"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding";
ADD REPLY
0
Entering edit mode

fasta reference, not fastq, so the file you built the index from.

ADD REPLY
0
Entering edit mode

I'm not actually sure where I would find the fasta file. I'm completely new to coding in general so I apologize. I do have a folder I made called rn6_index which includes all the genome.1.ht2, genome.2.ht2, etc. files. There's an additional file in there called make_rn6.sh. Would it be located in any of these?

ADD REPLY
0
Entering edit mode

No worries :) Ah I see, you downloaded an index. Then please run for any of your bam files

samtools view -H your.bam

This will show you the chromosome names so you can check whether it matches the GTF.

ADD REPLY
0
Entering edit mode

Thanks for understanding! So here is a couple lines of that result.

 @HD    VN:1.0  SO:queryname
@SQ SN:chr1 LN:282763074
@SQ SN:chr10    LN:112626471
@SQ SN:chr10_KL568008v1_random  LN:2704
@SQ SN:chr10_KL568009v1_random  LN:27282
@SQ SN:chr10_KL568010v1_random  LN:28524
@SQ SN:chr10_AABR07050887v1_random  LN:1962
@SQ SN:chr10_KL568011v1_random  LN:24114
@SQ SN:chr10_AABR07050894v1_random  LN:590
@SQ SN:chr10_KL568012v1_random  LN:22748
@SQ SN:chr10_KL568013v1_random  LN:5986
@SQ SN:chr10_AABR07050850v1_random  LN:548
@SQ SN:chr10_KL568014v1_random  LN:8466
@SQ SN:chr10_KL568015v1_random  LN:15525
@SQ SN:chr10_KL568016v1_random  LN:26063

Looks like its using the chr notation... should I use a different GTF file then?

ADD REPLY
0
Entering edit mode

Indeed there is the mismatch. Can you link the source of the index (so the download link or website it is from)?

ADD REPLY
0
Entering edit mode

Here is the link to where I obtained the rat genome files.

https://useast.ensembl.org/Rattus_norvegicus/Info/Index

ADD REPLY
0
Entering edit mode

Actually now that I can see the problem, it seems my sam/bam files are in UCSC format (chr1) while the GTF I downloaded is in the Ensemble format (1). I've been searching for rat GTF files in UCSC and cannot seem to find one. This leads me to believe I should maybe convert my sam/bam files to Ensemble? Not sure how to go about that or if that is even the logical answer.

ADD REPLY
0
Entering edit mode

You can't convert like that. You have to start from scratch. Make a genome index with your new matching genome + gtf with STAR, and align with that.

ADD REPLY
0
Entering edit mode

Can you provide your code?

ADD REPLY
0
Entering edit mode

Sure! Here is what I can show. Just for more reference, I have 12 samples, with 6 replicates each all located within a folder titled "Day1". This language is just because the last code I ran included day1 and day2 samples.

 conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
#conda install hisat2
#conda install htseq

# Build the mm10 HISAT2 index if necessary
multi_dir=/hidden
index=/hidden/rn6_index/genome
gtf=/home/hidden/rn6_gtf/Rattus_norvegicus.Rnor_6.0.104.gtf


# Set up counts and bigWig directories
mkdir $multi_dir/counts $multi_dir/bw $multi_dir/FASTQ/sorted

# Run alignment. Verify strand orientation in HISAT2 and HTSeq
NUM_CORES=16

# Define your group names and replicates
sample_names="cntrl-008 cntrl-009 cntrl-010 cntrl-011 cntrl-012 cntrl-013 expt-002 expt-003 expt-004 expt-005 expt-006 expt-007"


for i in $sample_name
do
    day1_samples=`ls /hidden/FASTQ/Day1/$i*.fastq.gz | grep .gz$ | tr "\n" "," | sed 's/.$//'`
    #day2_samples=`ls /home/hidden/FASTQ/Day2/$i*.fastq.gz | grep .gz$ | tr "\n" "," | sed 's/.$//'`
    echo "Samples to process for this iteration of the for loop: "
    echo $samples
    echo "..."


        # Perform alignment (paired end)
        #hisat2 -p $NUM_CORES --rna-strandness R -x $index -U $day1_samples,$day2_samples -S ${i}.sam
        hisat2 -p $NUM_CORES --rna-strandness R -x $index -U $day1_samples -S ${i}.sam

        # Convert BAMs to SAM and then sort by queryname for htseq-count
        samtools view -@ $NUM_CORES -b -o ${i}.bam ${i}.sam
        samtools sort -n -o ${i}_sorted.bam -@ $NUM_CORES ${i}.bam

        # Perform transcript counting using GTF defined earlier
        htseq-count --stranded=yes -f bam ${i}_sorted.bam $gtf > ${i}_count.txt

        rm ${i}.sam
        rm ${i}.bam
done
ADD REPLY
0
Entering edit mode

index=/hidden/rn6_index/genome

The fasta file is almost certainly sitting with the index files, that's what you need to check. If you really can't find it, you need to start over with genome and gtf files you know match.

ADD REPLY
0
Entering edit mode

Hi tul66893, why did you delete this comment?

ADD REPLY

Login before adding your answer.

Traffic: 986 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6