Hi!
I am trying to run a transposable detection software called McClintock on paired-end Illumina reads. For this step, I created a Python script to loop over the files in my directory and select the proper pair of reads. However, I am dealing with a significant amount of data. For this reason, I compressed all my data and I want to create a script that once the right pair is selected, I could unzip them and continue with the McClintock bash code. Once the code is finished, again, I want to compress the file and continue with the same process for the remaining paired-end reads. This is my Python script (not complete) so far but I have to admit that I don't know how to select the unzipped file in a variable. I don't want to use gunzip -c --stdout --to-stdout
as I said, I don't have enough data storage to keep original files unchanged and create new ones.
#!/usr/bin/env python
import os
import subprocess
if __name__=='__main__':
path = "/hosts/linuxhome/chaperone/silviav/reads/Gallone/Trimmed_files"
dir_files = os.listdir(path)
pair_reads = {}
for file in sorted(dir_files):
if file.endswith("_paired_R1.fastq.gz"):
file1 = file
if file.endswith("_paired_R2.fastq.gz"):
file2 = file
pair_reads[file1] = file2
for key, value in pair_reads.items():
cmd_key = "gunzip {}".format(key)
unzipped_key = subprocess.check_output(cmd_key, shell =True)
cmd_value = "gunzip {}".format(value)
unzipped_value = subprocess.check_output(cmd_value, shell = True)
code = "bash ~/mcclintock/mcclintock.sh -r ~/mcclintock/test/sacCer2.fasta -c ~/mcclintock/test/sac_cer_TE_seqs.fasta -g ~/mcclintock/test/reference_TE_locations.gff -t ~/mcclintock/test/sac_cer_te_families.tsv -1 {} -2 {} -p 36".format(unzipped_key, unzipped_value)
cmd = subprocess.check_output(code, shell =True)
print( "EXIT STATUS AND TYPE", cmd)
Thank you in advance.
Are you sure looking for TEs in your reads is the best option? Can the software not take assemblies?
I also second the other comments, that there is not really any reason to use
bash
ANDpython
here, one or the other should be able to handle all the steps (if you count shelling out in python).If you really wanted to do this, you could create a python script which accepts STDIN as the data stream and then decompress the data somewhat on the fly in bash...
Perfect, I will have a look at the STDIN (sys.stdin) as I am not familiar with it. Additionally, the software can look at TEs on fastq paired-end sequencing reads and not assemblies. Thank you for your help.
you asked many questions on this forum without validating any answer (e.g: C: Use of export command ; C: Software testing process failure: How can I match the already installed programs ; etc... ) . Please validate the correct answers (green mark on the left) to validate+close the questions.
I cannot find the green mark on the left. How does it look?
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Perfect! Thank you for clarifying.
instead of wrapping
bash
in a python script, how about usingbash
only ?