Entering edit mode
2.2 years ago
Phoebe Magdy
•
0
I've some sorted bam files and i wanted to mark the duplicate reads using MarkDuplicate in picard tool:
all files are present in a directory named AlignmentOfTrimmed_Sam_Files
the whole path for these files is defined below, and this is my current working directory,
After running this command several times which takes an hour each time and ith minor changes each time I was never able to find the output files
Any suggestions to help??
And thanks in advance
### Path of the directory where sorted bam files are located:
samfiles_dir = '/media/phmagdy/TOSHIBA_EXT/PhD_Data_Analysis/group3/AlignmentOfTrimmed_Sam_Files/'
### Loop over sorted bam files and markduplicates using picard tools
for file in os.listdir(samfiles_dir):
if file.endswith('sorted.bam'):
inputfile = os.path.join(samfiles_dir,file)
fileBasename = '_'.join(os.path.basename(file).rsplit('_',4)[0:3])
!java -Xmx20g -jar {picard_path}/picard.jar MarkDuplicates --INPUT {inputfile} \
--OUTPUT {fileBasename}.markdup.bam \
--METRICS_FILE {fileBasename}.metrics.txt
here is a part of the output :
MarkDuplicates starts at 2022-09-18 16:07:52.296874
16:07:53.413 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/phmagdy/miniconda3/envs/Jhm/share/picard-2.27.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Sep 18 16:07:53 EET 2022] MarkDuplicates --INPUT /media/phmagdy/TOSHIBA_EXT/PhD_Data_Analysis/group3/AlignmentOfTrimmed_Sam_Files/S000021_S5424Nr_7_sorted.bam --OUTPUT S000021_S5424Nr_7.markdup.bam --METRICS_FILE S000021_S5424Nr_7.metrics.txt --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Sun Sep 18 16:07:53 EET 2022] Executing as phmagdy@ubuntu on Linux 5.15.0-46-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.4-SNAPSHOT
INFO 2022-09-18 16:07:53 MarkDuplicates Start of doWork freeMemory: 208248760; totalMemory: 221249536; maxMemory: 19088801792
INFO 2022-09-18 16:07:53 MarkDuplicates Reading input file and constructing read end information.
INFO 2022-09-18 16:07:53 MarkDuplicates Will retain up to 69162325 data points before spilling to disk.
INFO 2022-09-18 16:08:00 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:06s. Time for last 1,000,000: 6s. Last read position: chr1:16,264,133
INFO 2022-09-18 16:08:00 MarkDuplicates Tracking 3899 as yet unmatched pairs. 422 records in RAM.
INFO 2022-09-18 16:08:05 MarkDuplicates Read 2,000,000 records. Elapsed time: 00:00:11s. Time for last
You probably did not write the python code yourself otherwise you would be familiar with this. Code above is using
does your account have permission to write to the same directory the input files are in? If not you should change that option to a directory where you can write files.
In addition, the error is probably at the end of the log file (rather than the start that you posted above). Check in the last 25 lines and show us the error, if there is one.
actually i tried creating another folder with the name
MarkDup
inside the above directory to direct the output files to :here what the codes looked like :
and also I was not able to find the output files
N.B. there was no error at the end of the execution after almost one hour ... and here are the last few lines
That is odd, if you can create that directory then you can write files to it. Did the
metrics.txt
file also not get created? Can you add--VERBOSITY DEBUG
and run to see if we get more detail in the log.The created folder where the output was meant to be is completely empty after the execution and neither the .markdup.bam files nor the metrics.txt files were created >>>>
I also tried adding
-VERBOSITY DEBUG
and I think nothing different from before happened here is part of the output :Are your files name or query sorted?
Yes each bam file has its sorted version and the index beside it ... here is a screen shot of what the files look like
They are coordinate sorted
Hi Phoebe, today I met almost the same issue. It runs well without any error but no output. May I know have you resolved this issue yet? Thanks!