Hello, I am stuck in the problem of 'ValidateSamFile' of picard tools. I have checked this problem on different forums, but I didn't find any solution there.
I have used Hisat2 for the alignment of the paired-end fastq files (obtained after trimming by using trimmomatic tool) against Ensembl reference ids
hisat2 -p 8 --dta --summary-file summary -x 'path/to/Ensembl_ref/indexfile' -1 '/path/to/sample/S1_1p.fastq.gz' -2 '/path/to/sample/S1_2p.fastq.gz' -U '/path/to/sample/S1_1u.fastq.gz' -U '/path/to/sample/S2_2u.fastq.gz' -S S1.sam
After that I tried to validate the Sam file using Picard tools, 'ValidateSamFile'
java -jar picard.jar ValidateSamFile I=S.sam IGNORE_WARNINGS=true MODE=VERBOSE
which gave me the error,
> [Mon May 13 12:56:42 IST 2019] Executing as genomics@genomics-Precision-3630-Tower on Linux 4.13.0-1028-oem amd64;
> OpenJDK 64-Bit Server VM 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12;
> Deflater: Intel; Inflater: Intel; Provider GCS is not available;
> Picard version: 2.20.0-SNAPSHOT WARNING 2019-05-13 12:56:42 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur. INFO 2019-05-13 12:57:16
>SamFileValidator Validated Read 10,000,000 records.
> Elapsed time: 00:00:34s. Time for last 10,000,000: 34s. Last read position: 8:115,115,834 INFO 2019-05-13 12:57:57 SamFileValidator Validated Read 20,000,000 records.
> Elapsed time: 00:01:15s. Time for last 10,000,000: 40s. Last read position: 7:43,608,287
>ERROR: Read name S1.916145.1, Mate not found for paired read
>ERROR: Read name S1.916145.2, Mate not found for paired read
>ERROR: Read name S1.9977032.1, Mate not found for paired read
>ERROR: Read name S1.9977032.2, Mate not found for paired read
>ERROR: Read name S1.4916847.1, Mate not found for paired read
As per the picard tools guidelines, I have used FixMateInformation, to fix the above error, by using the following command,
java -jar picard.jar FixMateInformation I=S1.sam O=new_fixed_S1.sam
The error seems to be fixed,
> [Mon May 13 12:52:03 IST 2019] Executing as genomics@genomics-Precision-3630-Tower on Linux 4.13.0-1028-oem amd64;
> OpenJDK 64-Bit Server VM 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12;
> Deflater: Intel; Inflater: Intel; Provider GCS is not available;
> Picard version: 2.20.0-SNAPSHOT INFO 2019-05-13
> 12:52:03 FixMateInformation Sorting input into queryname order.
> INFO 2019-05-13 12:53:34 SortingCollection Creating merging iterator from 43 files
>INFO 2019-05-13 12:53:34 FixMateInformation Sorting by queryname complete.
>INFO 2019-05-13 12:53:34 FixMateInformation Output will be sorted by unsorted
>INFO 2019-05-13 12:53:34 FixMateInformation Traversing query name sorted records and fixing up mate pair information.
>INFO 2019-05-13 12:53:36 FixMateInformation Processed 1,000,000 records. Elapsed time: 00:00:02s. Time for last 1,000,000: 2s. Last read position:
> */* INFO 2019-05-13 12:53:39 FixMateInformation Processed 2,000,000 records. Elapsed time: 00:00:05s. Time for last 1,000,000: 2s. Last read position: 16:173,485
>INFO 2019-05-13 12:53:41 FixMateInformation Processed 3,000,000 records. Elapsed time: 00:00:07s. Time for last 1,000,000: 2s. Last read position: MT:2,103
>INFO 2019-05-13 12:53:44 FixMateInformation Processed 4,000,000 records. Elapsed time: 00:00:10s. Time for last 1,000,000: 2s. Last read position: 2:101,004,226
Further, I revalidated the processed sam file, by using ValidateSamFile,
java -jar picard.jar ValidateSamFile I=new_fixed_S1.sam IGNORE_WARNINGS=true MODE=SUMMARY IGNORE=MISSING_TAG_NM
resulted in
> ## HISTOGRAM java.lang.String
>Error Type Count
>ERROR:MATE_NOT_FOUND 18412234
that means the error is not getting fixed, I repeated the whole process again assuming that the error will get fixed with several attempts, but i am simply repeating the loop with no progress.
At last, I ignored the error, an I started with 'MarkDuplicates' tool of picard tools using the following command
java -jar picard.jar MarkDuplicates I=new_fixed_S1.sam O=new_S1.sam M=marked_dup_metrics.txt REMOVE_DUPLICATES=true READ_NAME_REGEX=null
It resulted
[Mon May 13 13:13:13 IST 2019] Executing as genomics@genomics-Precision-3630-Tower on Linux 4.13.0-1028-oem amd64; OpenJDK 64-Bit Server VM 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.0-SNAPSHOT
INFO 2019-05-13 13:13:13 MarkDuplicates Start of doWork freeMemory: 996413816; totalMemory: 1011351552; maxMemory: 14974713856
INFO 2019-05-13 13:13:13 MarkDuplicates Reading input file and constructing read end information.
INFO 2019-05-13 13:13:13 MarkDuplicates Will retain up to 54256209 data points before spilling to disk.
and the program is running since 2 hours, don't know what is the problem, whether samfile generated from Hisat2 is having some fault or my commands are wrong or I am missing any error fixing tool.
I followed the post on the biostars but from there also i didn't get any clue. Any help in this regard is deeply appreciated. Thank you.
Can you show the
trimmomatic
command?@ATpoint, any guess to solve this issue.
What are all these fastq files? You typically have one pair in a paired-end experiment.
Two fastq files are paired (-1/-2), and the remaining two are unpaired (-U) This is one of my post
Hope this thread (I think this other post might be of interest for you and any other having the same issue: https://www.biostars.org/p/18137/) can help you. It reports a similar problem and worked for me.