Question

Picard CreateSequenceDictionary - file doesn't look right?

0

Entering edit mode

4.1 years ago

bertb ▴ 20

Hello,

I am trying to eventually run CollectRnaSeqMetrics using PICARD, with the following command:

find UWN_t2.bam -exec echo java -jar ~/workspace/rnaseq/student_tools/picard.jar CollectRnaSeqMetrics I={} O=picard/{}.RNA_Metrics REF_FLAT=$RNA_HOME/refs/sacCer3.ncbiRefSeq.ref_flat.txt STRAND=SECOND_READ_TRANSCRIPTION_STRAND RIBOSOMAL_INTERVALS=$RNA_HOME/refs/ref_ribosome.interval_list \; | sh

but I one of the lines of the output gave me the following warning:

WARNING CollectRnaSeqMetrics    The RIBOSOMAL_INTERVALS file, /home/user/workspace/rnaseq/refs/ref_ribosome.interval_list does not contain intervals

I looked back through some of the files, and noticed that my .dict file looks just like my ref_ribosome.interval_list file:

cat sacCer3.dict
@HD VN:1.5
@SQ SN:chrI LN:230218   M5:6681ac2f62509cfc220d78751b8dc524 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrII    LN:813184   M5:97a317c689cbdd7e92a5c159acd290d2 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrIII   LN:316620   M5:54f4a74aa6392d9e19b82c38aa8ab345 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrIV    LN:1531933  M5:74180788027e20df3de53dcb2367d9e3 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrIX    LN:439888   M5:4eae53ae7b2029b7e1075461c3eb9aac UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrV LN:576874   M5:d2787193198c8d260f58f2097f9e1e39 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrVI    LN:270161   M5:b7ebc601f9a7df2e1ec5863deeae88a3 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrVII   LN:1090940  M5:a308c7ebf0b67c4926bc190dc4ba8ed8 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrVIII  LN:562643   M5:f66a4f8eef89fc3c3a393fe0210169f1 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrX LN:745751   M5:6757b8c7d9cca2c56401e2484cf5e2fb UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrXI    LN:666816   M5:e72df2471be793f8aa06850348a896fa UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrXII   LN:1078177  M5:77945d734ab92ad527d8920c9d78ac1c UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrXIII  LN:924431   M5:073f9ff1c599c1a6867de2c7e4355394 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrXIV   LN:784333   M5:188bca5110182a786cd42686ec6882c6 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrXV    LN:1091291  M5:7e02090a38f05459102d1a9a83703534 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrXVI   LN:948066   M5:232475e9a61a5e07f9cb2df4a2dad757 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
@SQ SN:chrM LN:85779    M5:71c39cf065b8d574f636b654c274cf1b UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa

I'm guessing this is not what the .dict file is supposed to look like. The yeast fasta file sacCer3.fa seems fine, and so I'm wondering why when I create the .dict file with:

java -jar ~/workspace/rnaseq/student_tools/picard.jar CreateSequenceDictionary R=sacCer3.fa O=sacCer3.dict

the output is

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    CreateSequenceDictionary -R sacCer3.fa -O sacCer3.dict
**********


    11:22:22.142 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user/workspace/rnaseq/student_tools/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Wed Oct 14 11:22:22 EDT 2020] CreateSequenceDictionary OUTPUT=sacCer3.dict REFERENCE=sacCer3.fa    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Wed Oct 14 11:22:22 EDT 2020] Executing as user@user-VirtualBox on Linux 5.4.0-48-generic amd64; OpenJDK 64-Bit Server VM 11.0.8+10-post-Ubuntu-0ubuntu120.04; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.15-SNAPSHOT
    [Wed Oct 14 11:22:22 EDT 2020] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.02 minutes.
    Runtime.totalMemory()=373030912

... and a strange file is created.

Any help greatly appreciated! Thanks

Note: Running on a VM

RNA-Seq sequencing • 1.6k views

ADD COMMENT • link updated 4.1 years ago by Biostar 20 • written 4.1 years ago by bertb ▴ 20

0

Entering edit mode

I'm guessing this is not what the .dict file is supposed to look like

Why do you think that? The dict file looks fine to me.

ADD REPLY • link 4.1 years ago by Ram 44k

0

Entering edit mode

Ok, I didn't realize that was what it was supposed to look like. But is the .interval_list file supposed to look the same?

When I create that file:

java -jar ~/workspace/rnaseq/student_tools/picard.jar BedToIntervalList I=ref_ribosome.bed O=ref_ribosome.interval_list SD=sacCer3.dict
INFO    2020-10-14 11:22:58 BedToInte

rvalList    

    ********** NOTE: Picard's command line syntax is changing.
    **********
    ********** For more information, please see:
    ********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
    **********
    ********** The command line looks like this in the new syntax:
    **********
    **********    BedToIntervalList -I ref_ribosome.bed -O ref_ribosome.interval_list -SD sacCer3.dict
    **********


    11:22:59.523 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user/workspace/rnaseq/student_tools/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Wed Oct 14 11:22:59 EDT 2020] BedToIntervalList INPUT=ref_ribosome.bed SEQUENCE_DICTIONARY=sacCer3.dict OUTPUT=ref_ribosome.interval_list    SORT=true UNIQUE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Wed Oct 14 11:22:59 EDT 2020] Executing as user@user-VirtualBox on Linux 5.4.0-48-generic amd64; OpenJDK 64-Bit Server VM 11.0.8+10-post-Ubuntu-0ubuntu120.04; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.15-SNAPSHOT
    [Wed Oct 14 11:22:59 EDT 2020] picard.util.BedToIntervalList done. Elapsed time: 0.00 minutes.
    Runtime.totalMemory()=373030912

I get a file that looks like this:

  cat ref_ribosome.interval_list
    @HD VN:1.5  SO:coordinate
    @SQ SN:chrI LN:230218   M5:6681ac2f62509cfc220d78751b8dc524 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrII    LN:813184   M5:97a317c689cbdd7e92a5c159acd290d2 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrIII   LN:316620   M5:54f4a74aa6392d9e19b82c38aa8ab345 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrIV    LN:1531933  M5:74180788027e20df3de53dcb2367d9e3 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrIX    LN:439888   M5:4eae53ae7b2029b7e1075461c3eb9aac UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrV LN:576874   M5:d2787193198c8d260f58f2097f9e1e39 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrVI    LN:270161   M5:b7ebc601f9a7df2e1ec5863deeae88a3 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrVII   LN:1090940  M5:a308c7ebf0b67c4926bc190dc4ba8ed8 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrVIII  LN:562643   M5:f66a4f8eef89fc3c3a393fe0210169f1 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrX LN:745751   M5:6757b8c7d9cca2c56401e2484cf5e2fb UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrXI    LN:666816   M5:e72df2471be793f8aa06850348a896fa UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrXII   LN:1078177  M5:77945d734ab92ad527d8920c9d78ac1c UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrXIII  LN:924431   M5:073f9ff1c599c1a6867de2c7e4355394 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrXIV   LN:784333   M5:188bca5110182a786cd42686ec6882c6 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrXV    LN:1091291  M5:7e02090a38f05459102d1a9a83703534 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrXVI   LN:948066   M5:232475e9a61a5e07f9cb2df4a2dad757 UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa
    @SQ SN:chrM LN:85779    M5:71c39cf065b8d574f636b654c274cf1b UR:file:/home/user/workspace/rnaseq/refs/sacCer3.fa

Is this right? Just wondering where I've gone wrong here

ADD REPLY • link 4.1 years ago by bertb ▴ 20

0

Entering edit mode

The interval list is supposed to contain all intervals from the source BED file. Do the chromosome names match between the BED and DICT files?

ADD REPLY • link 4.1 years ago by Ram 44k

0

Entering edit mode

I think it's a problem with my ref_ribosome.gtf file, which I create before converting it to a .bed file. The original yeast reference gtf file (sacCer3.ncbiRefSeq.gtf) looks fine, but when I try to extract the location of the ribosomal sequences:

grep --color=none -i rrna sacCer3.ncbiRefSeq.gtf > ref_ribosome.gtf

...the file that gets created looks empty. Is there a problem with this command?

Thanks,

ADD REPLY • link 4.1 years ago by bertb ▴ 20

1

Entering edit mode

The command is fine, but if it results in an empty output, all that tells me is that there is no rrna string in the gtf file. Are you sure rRNA entries definitely contain the string rrna? I'm assuming you're looking for rRNA entries - let me know if that is not the case.

ADD REPLY • link 4.1 years ago by Ram 44k

0

Entering edit mode

You were right - looking in the .gtf file, there were no rrna strings. I had to specifically pull out the gene names:

grep "RDN37\|ETS2\|RDN2\|ITS2\|RDN5\|ITS1\|RDN18\|ETS1\|RDN3" sacCer3.ncbiRefSeq.gtf > ref_ribosome_1.gtf

Thanks for your help with this. By the way, if this is not too much of a departure from this thread, what is the most common way to view the CollectRnaSeqMetrics output, for instance, in an Excel file? Thanks

ADD REPLY • link 4.1 years ago by bertb ▴ 20