sortmerna not counting the EOF REV file, not working on paired end reads
1
0
Entering edit mode
9 months ago
jway • 0

I am working with paired-end fastq.gz files from RNA-seq, each compressed file is around 1 GB. I used Cutadapt 4.6 to trim the Illumina universal adapters from my sample using the command cutadapt -q 20 -a AGATCGGAAGAG -A AGATCGGAAGAG -o R196-11.out.1.fastq.gz -p R196-11.out.2.fastq.gz R1 96-11_S39_L002_R1_001.fastq.gz R196-11_S39_L002_R2_001.fastq.gz and got the below summary statistics:

Total read pairs processed:         17,584,226
  Read 1 with adapter:              13,027,493 (74.1%)
  Read 2 with adapter:              12,007,537 (68.3%)
Pairs written (passing filters):    17,584,226 (100.0%)

Total basepairs processed: 5,310,436,252 bp
  Read 1: 2,655,218,126 bp
  Read 2: 2,655,218,126 bp
Quality-trimmed:              89,487,524 bp (1.7%)
  Read 1:    64,207,700 bp
  Read 2:    25,279,824 bp
Total written (filtered):  4,162,879,762 bp (78.4%)
  Read 1: 2,002,617,872 bp
  Read 2: 2,160,261,890 bp

I then tried to use sortmerna on the output files to filter out the rRNA, and got the below trace. The program didn't finish running. I previously ran sortmerna successfully on some smaller test samples (raw FASTQ files that did not have their adapters trimmed), and was able to get a total read count for both EOF FWD and EOF REV. What could be causing sortmerna to not get a read count for EOF REV, and for it to display "EOF FWD reached. Total reads: 1" during the alignment step?

(sortmerna_env) root@WayLT2210:~# sortmerna --ref rRNA_databases_v4/smr_v4.3_default_db.fasta --ref rRNA_databases_v4/smr_v4.3_fast_db.fasta --ref rRNA_databases_v4/smr_v4.3_sensitive_db.fasta --ref rRNA_databases_v4/smr_v4.3_sensitive_db_rfam_seeds.fasta --reads R196-11.out.1.fastq.gz --reads R196-11.out.2.fastq.gz --paired_out --fastx --threads 8

[process:1393] === Options processing starts ... ===

Found value: sortmerna

Found flag: --ref

Found value: rRNA_databases_v4/smr_v4.3_default_db.fasta of previous flag: --ref

Found flag: --ref

Found value: rRNA_databases_v4/smr_v4.3_fast_db.fasta of previous flag: --ref

Found flag: --ref

Found value: rRNA_databases_v4/smr_v4.3_sensitive_db.fasta of previous flag: --ref

Found flag: --ref

Found value: rRNA_databases_v4/smr_v4.3_sensitive_db_rfam_seeds.fasta of previous flag: --ref

Found flag: --reads

Found value: R196-11.out.1.fastq.gz of previous flag: --reads

Found flag: --reads

Found value: R196-11.out.2.fastq.gz of previous flag: --reads

Found flag: --paired_out

Previous flag: --paired_out is Boolean. Setting to True

Found flag: --fastx

Previous flag: --fastx is Boolean. Setting to True

Found flag: --threads

Found value: 8 of previous flag: --threads

[process:1483] Processing option: fastx with value:

[process:1483] Processing option: paired_out with value:

[process:1483] Processing option: reads with value: R196-11.out.1.fastq.gz

[opt_reads:98] Processing reads file [1] out of total [2] files

[process:1483] Processing option: reads with value: R196-11.out.2.fastq.gz

[opt_reads:98] Processing reads file [2] out of total [2] files

[process:1483] Processing option: ref with value: rRNA_databases_v4/smr_v4.3_default_db.fasta

[opt_ref:158] Processing reference [1] out of total [4] references

[opt_ref:206] File "/root/rRNA_databases_v4/smr_v4.3_default_db.fasta" exists and is readable

[process:1483] Processing option: ref with value: rRNA_databases_v4/smr_v4.3_fast_db.fasta

[opt_ref:158] Processing reference [2] out of total [4] references

[opt_ref:206] File "/root/rRNA_databases_v4/smr_v4.3_fast_db.fasta" exists and is readable

[process:1483] Processing option: ref with value: rRNA_databases_v4/smr_v4.3_sensitive_db.fasta

[opt_ref:158] Processing reference [3] out of total [4] references

[opt_ref:206] File "/root/rRNA_databases_v4/smr_v4.3_sensitive_db.fasta" exists and is readable

[process:1483] Processing option: ref with value: rRNA_databases_v4/smr_v4.3_sensitive_db_rfam_seeds.fasta

[opt_ref:158] Processing reference [4] out of total [4] references

[opt_ref:206] File "/root/rRNA_databases_v4/smr_v4.3_sensitive_db_rfam_seeds.fasta" exists and is readable

[process:1483] Processing option: threads with value: 8

[process:1503] === Options processing done ===

[process:1504] Alignment type: [best:1 num_alignments:1 min_lis:2 seeds:2]

[validate_kvdbdir:1242] 'workdir' option was not provided. Using USERDIR to set the working directory: ""

[validate_kvdbdir:1248] Key-value DB location "/root/sortmerna/run/kvdb"

[validate_kvdbdir:1284] Creating KVDB directory: "/root/sortmerna/run/kvdb"

[validate_idxdir:1214] Using index directory: "/root/sortmerna/run/idx"

[validate_idxdir:1230] IDX directory: "/root/sortmerna/run/idx" exists and is not empty

[validate_readb_dir:1306] Using split reads directory : "/root/sortmerna/run/readb"

[validate_readb_dir:1322] split reads directory : "/root/sortmerna/run/readb" exists and is not empty

[main:62] Running command:

sortmerna --ref rRNA_databases_v4/smr_v4.3_default_db.fasta --ref rRNA_databases_v4/smr_v4.3_fast_db.fasta --ref rRNA_databases_v4/smr_v4.3_sensitive_db.fasta --ref rRNA_databases_v4/smr_v4.3_sensitive_db_rfam_seeds.fasta --reads R196-11.out.1.fastq.gz --reads R196-11.out.2.fastq.gz --paired_out --fastx --threads 8

[Index:102] Found 16 non-empty index files. Skipping indexing.

[init:108] Readfeed init started

[define_format:881] file: "R196-11.out.1.fastq.gz" is FASTQ gzipped

[define_format:881] file: "R196-11.out.2.fastq.gz" is FASTQ gzipped

[count_reads:915] started count  ...

[next:322] EOF FWD reached. Total reads: 17033568

[count_reads:945] done count. Elapsed time: 88.6886 sec. Total reads: 34067136

[init_split_files:967] added file: /root/sortmerna/run/readb/fwd_0.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/rev_0.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/fwd_1.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/rev_1.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/fwd_2.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/rev_2.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/fwd_3.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/rev_3.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/fwd_4.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/rev_4.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/fwd_5.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/rev_5.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/fwd_6.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/rev_6.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/fwd_7.fq.gz

[init_split_files:967] added file: /root/sortmerna/run/readb/rev_7.fq.gz

[is_split_ready:726] found existing readfeed descriptor /root/sortmerna/run/readb/readfeed

[split:605] start splitting. Using number of splits equals number of processing threads: 8

[clean:1098] found descriptor /root/sortmerna/run/readb/readfeed

[clean:1142] removing split file: /root/sortmerna/run/readb/fwd_0.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/rev_0.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/fwd_1.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/rev_1.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/fwd_2.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/rev_2.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/fwd_3.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/rev_3.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/fwd_4.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/rev_4.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/fwd_5.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/rev_5.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/fwd_6.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/rev_6.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/fwd_7.fq.gz

[clean:1142] removing split file: /root/sortmerna/run/readb/rev_7.fq.gz

[next:322] EOF FWD reached. Total reads: 17033568

[split:717] Done splitting. Reads count: 34067136 Runtime sec: 1043.51



[init:135] Readfeed init done in sec [1132.2]

[store_to_db:292] Stored Reads statistics to DB:

all_reads_count= 34067136 all_reads_len= 4098461419 min_read_len= 1 max_read_len= 151 total_aligned= 0 total_aligned_id= 0 total_aligned_cov= 0 total_aligned_id_cov= 0 total_denovo= 0 num_short= 0 reads_matched_per_db= TODO is_stats_calc= 0 is_total_reads_mapped_cov= 0

[align:143] ==== Starting alignment ====

[align:146] Number of cores: 12

[align:163] Using number of Processor threads: 8

[Refstats:60] Index Statistics calculation starts ... done in: 1.14442 sec

[align:185] Loading index: 0 part: 1/1 Memory KB: 19 ...

[align:190] done in [6.53606] sec Memory KB: 3197

[align:193] Loading references ...

[align:197] done in [0.525161] sec. Memory KB: 3351

[align2:70] Processor 1 thread 140642311198464 started

[align2:70] Processor 2 thread 140642319591168 started

[align2:70] Processor 0 thread 140642302805760 started

[align2:70] Processor 4 thread 140642327983872 started

[align2:70] Processor 6 thread 140641682933504 started

[align2:70] Processor 7 thread 140641674540800 started

[align2:70] Processor 5 thread 140641691326208 started

[align2:70] Processor 3 thread 140642764175104 started

[next:455] EOF FWD reached. Total reads: 1

[next:455] EOF FWD reached. Total reads: 1

[next:455] EOF FWD reached. Total reads: 1

[next:455] EOF FWD reached. Total reads: 1

[next:455] EOF REV reached. Total reads: 1

[align2:133] Processor 4 thread 140642327983872 done. Processed 0 reads. Skipped already processed: 0 reads Aligned reads (passing E-value): 0 Runtime sec: 23.7849

[align2:133] Processor 2 thread 140642319591168 done. Processed 0 reads. Skipped already processed: 0 reads Aligned reads (passing E-value): 0 Runtime sec: 23.869

[align2:133] Processor 3 thread 140642764175104 done. Processed 0 reads. Skipped already processed: 0 reads Aligned reads (passing E-value): 0 Runtime sec: 24.0053

[align2:133] Processor 7 thread 140641674540800 done. Processed 0 reads. Skipped already processed: 0 reads Aligned reads (passing E-value): 0 Runtime sec: 24.0421

[align2:133] Processor 1 thread 140642311198464 done. Processed 1 reads. Skipped already processed: 0 reads Aligned reads (passing E-value): 1 Runtime sec: 24.2443
sortmerna cutadapt • 808 views
ADD COMMENT
0
Entering edit mode

Can you try to validate that you have non-corrupt fastq files by using one of the tools mentioned here: Checking fastq is valid

ADD REPLY
0
Entering edit mode

I downloaded fq lint and ran fq lint R196-11.out.1.fastq.gz R196-11.out.2.fastq.gz, and got the following:

2024-03-01T23:15:11.373488Z  INFO fq::commands::lint: fq-lint start

2024-03-01T23:15:11.373543Z  INFO fq::commands::lint: validating paired end reads

2024-03-01T23:15:11.373566Z  INFO fq::validators: disabled validators: []

2024-03-01T23:15:11.373569Z  INFO fq::validators: enabled single read validators: ["[S003] NameValidator", "[S004] CompleteValidator", "[S002] AlphabetValidator", "[S001] PlusLineValidator", "[S005] ConsistentSeqQualValidator", "[S006] QualityStringValidator"]

2024-03-01T23:15:11.373577Z  INFO fq::validators: enabled paired read validators: ["[P001] NamesValidator"]

2024-03-01T23:15:11.373584Z  INFO fq::commands::lint: enabled special validators: ["[S007] DuplicateNameValidator"]

2024-03-01T23:15:11.373585Z  INFO fq::commands::lint: starting validation (pass 1)
R196-11.out.1.fastq.gz:22:1: [S004] CompleteValidator: empty sequence
ADD REPLY
0
Entering edit mode
8 months ago
GenoMax 147k

Looks like one of the reads was completely trimmed but the 0 length fastq record was left in place. I am not a cutadapt user but you should look for and add an option to make sure reads remain a certain length (say 25 bp, smaller reads are not useful for alignments) and discard those that fail the criteria.

With bbduk.sh you would normally do this with minlength=25 option. If one read fails this criterial then it and its mate will be removed from the files.

ADD COMMENT
0
Entering edit mode

Thanks for the help, I ran cutadapt -q 15 -a AGATCGGAAGAG -A AGATCGGAAGAG -m 35 -o R196-11.out.1.fastq.gz -p R196-11.out.2.fastq .gz R196-11_S39_L002_R1_001.fastq.gz R196-11_S39_L002_R2_001.fastq.gz to remove reads less than 35 bp long and I think it worked, I got the following after running fq lint again:

 (base) root@WayLT2210:~# fq lint R196-11.out.1.fastq.gz R196-11.out.2.fastq.gz

2024-03-04T18:32:01.759423Z  INFO fq::commands::lint: fq-lint start

2024-03-04T18:32:01.759676Z  INFO fq::commands::lint: validating paired end reads

2024-03-04T18:32:01.759725Z  INFO fq::validators: disabled validators: []

2024-03-04T18:32:01.759731Z  INFO fq::validators: enabled single read validators: ["[S003] NameValidator", "[S004] CompleteValidator", "[S002] AlphabetValidator", "[S001] PlusLineValidator", "[S005] ConsistentSeqQualValidator", "[S006] QualityStringValidator"]

2024-03-04T18:32:01.759738Z  INFO fq::validators: enabled paired read validators: ["[P001] NamesValidator"]

2024-03-04T18:32:01.759751Z  INFO fq::commands::lint: enabled special validators: ["[S007] DuplicateNameValidator"]

2024-03-04T18:32:01.759767Z  INFO fq::commands::lint: starting validation (pass 1)

2024-03-04T18:32:34.244239Z  INFO fq::commands::lint: read 16394279 * 2 records

2024-03-04T18:32:34.244285Z  INFO fq::commands::lint: starting validation (pass 2)

2024-03-04T18:32:44.033220Z  INFO fq::commands::lint: read 16394279 records

2024-03-04T18:32:44.035599Z  INFO fq::commands::lint: fq-lint end
ADD REPLY

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6