Entering edit mode
4.4 years ago
newbie
▴
130
I'm using rseqc on my samples bam file to check the strand specific. For this I used gencode gtf file. Initially I converted gencode gtf to bed file like:
This is how the gtf looks:
##description: evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90)
##provider: GENCODE
##contact: gencode-help@sanger.ac.uk
##format: gtf
##date: 2017-08-01
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
I created bed file with:
awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' gencode.v27.annotation.gtf | gtf2bed - > gencode.v27.annotation.bed
And then I used that bed file for infer experiment.py
python infer_experiment.py -r gencode.v27.annotation.bed -I Sample.sorted.bam
Error:
Reading reference gene model gencode.v27.annotation.bed ... Done
Loading SAM/BAM file ... Finished
Total 0 usable reads were sampled
Unknown Data type
What does the BED file look like?
My bam file looks like this:
Without
rseqc
simply looking at tags and genes can we tell whether data is strand specific or not. If so how?can you please help me with this...for my previous comment.
No we can't tell by just looking at alignments. What result did you get from running the script?
This is PairEnd Data Fraction of reads failed to determine: 0.0000
As "1+-,1-+,2++,2--" gets more signal which means fr-firststrand. So, in HISAT2 alignment step the –rna –strandness parameter will be RF.
A colleague of mine without
rseqc
by looking at tags and genes told me that the data is strand specific. But I'm curious to know how he said that.This means you have a standard (dUTP-based) strand-specific library. They may have done it looking at the
XS:A
tags. Every aligner does not do that so safer to check with the script.okay. you mean in the above bam it is seen that
XS:A
is given-
so this tells the data is strand specific? Am I right? and how based on genes we can say that?