Hi,
I have been following this post to build my own ribosomal intervals for hg38.p14
Referenc Genome: GRCh38.p5 Ensembl release 84
#
#
# 1. Prepare chromosome sizes file from fasta sequence if needed.
#
# ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa
# cut -f1,2 Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai > sizes.genome
#
# 2. Make an interval_list file suitable for CollectRnaSeqMetrics.jar.
#
# Ensembl genes:
#
# ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
#
#
#
# Picard Tools CollectRnaSeqMetrics.jar:
#
# https://broadinstitute.github.io/picard/command-line-overview.html#CollectRnaSeqMetrics
chrom_sizes=sizes.genome
# rRNA interval_list file -------------------------------------------------
# Genes from Ensembl.
genes=/home/dell/Documents/Arindam/Work/ReferenceGenome/Human_84/Annotation/gtf/Homo_sapiens.GRCh38.84.gtf
# Output file suitable for Picard CollectRnaSeqMetrics.jar.
rRNA=GRCh38.p5.rRNA.interval_list
# Sequence names and lengths. (Must be tab-delimited.)
perl -lane 'print "\@SQ\tSN:$F[0]\tLN:$F[1]\tAS:GRCh38"' $chrom_sizes | \
grep -v _ \ >> "$rRNA"
# Intervals for rRNA transcripts.
grep 'gene_biotype "rRNA"' $genes | \
awk '$3 == "gene"' | \
cut -f1,4,5,7,9 | \
perl -lane '
/gene_id "([^"]+)"/ or die "no gene_id on $.";
print join "\t", (@F[0,1,2,3], $1)
' | \
sort -k1V -k2n -k3n \ >> "$rRNA"
But, when I ran the command, I received an error message telling that 'GRCh38.p14.rRNA.interval_list does not contain intervals', what should I do?
I really stuck right now, really clueless what to do next. Any help and guidance are highly appreciated.
Thank you for your help. I created the rRNA interval list using yours, however when I ran CollectRnaSeqMetrics using this command
and, I received this error message
What should I do to fix this?
Many thanks
Ah, I think the awk is printed space-separated instead of tab-separated. I updated my awk comment adding
BEGIN{OFS="\t"}
Thanks a lot. It works perfectly :)
Just a final comment - it looks like the only rRNAs in the gtf are 5S and depending on how you do the library prep these might be lost during size selection. You should make sure that the larger rRNAs are included in this type of analysis, otherwise it won't be accurate.
How to make sure the larger rRNAs are included?
Find an annotation that contains them, or create it yourself.