Hello,
I was trying to use rseqc to check strandedness of my fastq files. It requires two inputs: an aligned SAM file and a BED file. To create the SAM file, I aligned my reads using bowtie2.
bowtie2 -q --phred33 --threads 2 -x index -1 SRR5655560_1.atria.fastq -2 SRR5655560_2.atria.fastq -S SRR5655560.sam
This is how the first few lines of the SAM file looks like :
@HD VN:1.0 SO:unsorted
@SQ SN:KRH76248 LN:1234
@SQ SN:KRH76252 LN:915
@SQ SN:KRH76250 LN:1578
@SQ SN:KRH76249 LN:1578
@SQ SN:KRH76247 LN:762
@SQ SN:KRH76251 LN:915
@SQ SN:KRH77790 LN:756
@SQ SN:KRH77789 LN:1433
@SQ SN:KRH76519 LN:2413
To create the BED file, I first downloaded the Glycine_max.Glycine_max_v2.1.55.gtf from Ensembl Plants.This is how the first few lines of the gtf looks like:
#!genome-build Glycine_max_v2.1
#!genome-version Glycine_max_v2.1
#!genome-date 2018-07
#!genome-build-accession GCA_000004515.4
#!genebuild-last-updated 2018-08
18 ena gene 7595306 7600287 . - . gene_id "GLYMA_18G079000"; gene_source "ena"; gene_biotype "protein_coding";
18 ena transcript 7595306 7600287 . - . gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
18 ena exon 7600155 7600287 . - . gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "KRG98527-1"; tag "Ensembl_canonical";
18 ena CDS 7600155 7600287 . - 0 gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "KRG98527"; tag "Ensembl_canonical";
18 ena start_codon 7600285 7600287 . - 0 gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
and converted it to BED using gtf2bed.py from https://github.com/signalbash/how_are_we_stranded_here/blob/master/how_are_we_stranded_here/gtf2bed.py
gtf2bed --gtf Glycine_max.Glycine_max_v2.1.55.gtf --bed output.bed
This is how the BED file looks like:
18 7595305 7600287 KRG98527 0 - 7595305 7600287 0 6 846,344,141,156,219,133, 0,2159,2873,3601,4344,4849,
18 7699667 7702882 KRG98546 0 + 7699667 7702882 0 2 1061,799, 0,2416,
18 51277502 51280090 KRH00628 0 - 51277502 51280090 0 5 337,115,121,66,135, 0,670,1746,2034,2453,
18 51277806 51280022 KRH00629 0 - 51277806 51280022 0 5 33,115,127,66,67, 0,366,1442,1730,2149,
18 47409830 47414924 KRH00168 0 - 47409830 47414924 0 10 693,57,78,88,84,63,74,71,67,605, 0,1102,1328,1490,1712,1878,2092,3112,3388,4489,
18 47409830 47414924 KRH00169 0 - 47409830 47414924 0 11 693,57,78,88,84,56,74,118,71,67,605, 0,1102,1328,1490,1712,1878,2092,2287,3112,3388,4489,
18 7625583 7632431 KRG98532 0 + 7625583 7632431 0 6 1280,95,123,807,217,882, 0,3342,3538,4049,5666,5966,
18 7625542 7632431 KRG98535 0 + 7625542 7632431 0 6 1321,95,123,807,217,882, 0,3383,3579,4090,5707,6007,
18 7625542 7632431 KRG98533 0 + 7625542 7632431 0 6 1321,95,123,807,217,882, 0,3383,3579,4090,5707,6007,
18 7625542 7632431 KRG98531 0 + 7625542 7632431 0 6 1321,95,123,807,217,882, 0,3383,3579,4090,5707,6007,
I've browsed several similar posts and noticed that my SAM and BED files do not look as "neat" in comparison. I'd like to know where I'm going wrong.
Best wishes!
edit: formatting
Answers on this post suggest there the chromosome name in your SAM file and those in your GTF do not match. For example the GTF says '18' and the SAM my say 'chr18'. Might be worth checking :)
Thanks for replying! the following SAM file output:
I have two questions: 1) where is the chromosome number here? the third columns? 2) should I sort my sam file before giving it to infer_experiemnt.py
KRH77790
(names after
SN:
) is the chromosome name (andLN
is the length). You can't use a reference and a GTF that came from two different sources.Thank you for the clarification. But what do you mean by reference and GTF from two different sources. I'm using the reference genome and gtf file for Glycine_max.Glycine_max_v2.1.
Did you check all
SQ
lines to see if you have an entry for I guess chromosome 18?Chromosome name looks like
18
in your BED file but is there a@SQ SN:18 LN:NNNNN
line in your alignment file?BTW: Have you tried running
infer_experiment.py
with files you have. Perhaps there is nothing wrong with them. Do not go on how they "look".GenoMax Hello, I rechecked the sources for ref genome and gtf. You were correct! You have no idea how thankful I am for people like you for taking the time to help a bioinfo noob. Again, thank you so much!
What turned out to be the issue ultimately?
My dumbass was using a transcripts cdna.fasta file instead of dna.top_level.fasta to build the bowtie2 index.
Hi pubsurfted,
I ran into the same issue using a reference transcriptome with HISAT2 from the Ensembl database. For others with similar experience, I will share details. For infer_experiment.py from the rseqc package, I passed .sam files generated from aligning (via HISAT2) my PE bulk RNA-sequencing reads to Ensembl's reference transcriptome (Homo_sapiens.GRCh38.cdna.all.fa) and a .bed file converted from Ensembl's gene annotation file (Homo_sapiens.GRCh38.112.gtf). When passing the .sam file and .bed file to infer_experiment.py, I receive this error:
As HISAT2 is specifically designed for aligning to a genome, aligning to the transcriptome will lead to problematic .sam files (i.e., may not be recognized as a known data type by programs expecting .sam files). As GenoMax described, the second column has chromosome names. In the context of Ensembl and HISAT2, a .sam file made from aligning to a reference transcriptome and sorted by read names (via Samtools) has a header like this:
On the contrary, a sorted .sam file made from aligning to a reference genome appears like this:
Alignment to the Ensembl genome (Homo_sapiens.GRCh38.dna.primary_assembly.fa) via HISAT2 resolved the error, generating appropriate .sam files that were recognized by rseqc programs.
Hope this clarifies this important detail for others.
Maze