Hi, I am new to analyzing RNA seq data and have just started running cutadapt to trim my adapter sequences from my paired end data. If I run one line of code at a time, I can see a very nice summary statistics as is given in this page of the cutadapt documentation (https://cutadapt.readthedocs.io/en/stable/guide.html#cutadapt-s-output). It looks something like this:
This is cutadapt 2.7 with Python 3.5.2
Command line parameters: --cores=14 -q 10,10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --info-file=summary1.txt -o EBSP3_L3_1_trimmed.fq.gz -p EBSP3_L3_2_trimmed.fq.gz /home/bnay2/JeanProject/JeanRawData/EBSP3_L3_1.fq.gz /home/bnay2/JeanProject/JeanRawData/EBSP3_L3_2.fq.gz
Processing reads on 14 cores in paired-end mode ...
[ 8=------] 00:34:05 31,431,443 reads @ 65.1 µs/read; 0.92 M reads/minute
Finished in 2045.05 s (65 us/read; 0.92 M reads/minute).
=== Summary ===
Total read pairs processed: 31,431,443
Read 1 with adapter: 866,710 (2.8%)
Read 2 with adapter: 839,966 (2.7%)
Pairs written (passing filters): 31,431,443 (100.0%)
Total basepairs processed: 9,429,432,900 bp
Read 1: 4,714,716,450 bp
Read 2: 4,714,716,450 bp
Quality-trimmed: 3,496,279 bp (0.0%)
Read 1: 1,212,666 bp
Read 2: 2,283,613 bp
Total written (filtered): 9,420,283,738 bp (99.9%)
Read 1: 4,710,662,353 bp
Read 2: 4,709,621,385 bp
=== First read: Adapter 1 ===
Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 3'; Length: 34; Trimmed: 866710 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp: 3
Bases preceding removed adapters:
A: 29.7%
C: 30.1%
G: 27.9%
T: 12.3%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
3 697484 491116.3 0 697484
4 129061 122779.1 0 129061
5 33700 30694.8 0 33700
6 2812 7673.7 0 2812
7 982 1918.4 0 982
8 144 479.6 0 144
But I wanted to run the code in a loop as part of a script, and I wanted to save these summary statistics information for each trimming into a text file where I can view it later.
I see that the documentation mentions this:
Format of the info file When the --info-file command-line parameter is given, detailed information about the found adapters is written to the given file. The output is a tab-separated text file. Each line corresponds to one read of the input file (unless –times is used, see below). A row is written for all reads, even those that are discarded from the final output FASTA/FASTQ due to filtering options (such as --minimum-length).
But when I tried --info-file=summary.txt, it generated a 12 GB file that has multiple lines of the following, which is very different from the kind of output I used to get shown above:
E00552:427:H7CVVCCX2:3:1101:5071:1379 1:N:0:NAGAGAGG+NACTCCTT -1 ATCCTGTGAGTGTGTGTATACAGATATTATAGAAATGCTTTTAGGCATCTTTGAAACCAAGCCTATGTGTGAATAGTTTGTGAAAGAGATGGCAAACTCGGATGGGGGAATACCCAAGGGTCATGGGTTTTATGTGTCGCTTTGGTGTA AAFFJJJJJJJJFJFJJJJFJJJJJJJJJJJJJJJJJFJJJJJJFJJJJJJJJJJJJJJJJJJJJFJ<JJJJJJJ<A-<<AJJJFFAJFJJJJJFFA-A-JF7A<<J-7<AFJ7-A<F<J7AJF7AJ7---<FA-A--<)7--<<-<AA
E00552:427:H7CVVCCX2:3:1101:5294:1379 1:N:0:NAGAGAGG+NACTCCTT -1 AAGGGGATAGGTGAGTGGCACGCCAAAGCCATGTGGGGGTCAGGGTACCAGAACTAGGCCCTGGTGCAGGAATACATCAGGCAGGAAGGGGGTGACAAAGGGGTCTGTGGGCTGCATCCATCTGGTTCTTGACTGTCGCTTATACACAT AAFFJJFJJJJAJJJFJJJ<AJJJJJJJJJJFJFAJJ<F<JJJJJFJJJFFJJJJFAJJFJJJJFJF-AJJJJJJJJ-JF<A<AFF<FAJ<-7<-A<-F<AA-7AAJ-7FAFJJJFAJ-7-<-F---7<FFJJ<AAA)A<FFAJ<AFJF
E00552:427:H7CVVCCX2:3:1101:5538:1379 1:N:0:NAGAGAGG+NACTCCTT -1 TCTTTCTCTAAAGAGCTGCTTCTAATCTCCTGCTGGAGGCCTCCTTGAGCTCTCAGACATGGCTTCTGCCAACCCAACCTGCTGCTGAGTAAGCCAAAGACTCTCCTTCCTGTAGAAACCAGCAAGGAGCGCCCTGCCCCTGTCCTTTT AAAFJJFJJJJJJJJJJJJ<FJJJJJJAJJJJJJFJJ-AJJJJJJJJJJJJJJJJJFJJJJFJJJJJJJJJJJJJFA-FJ-FJAFJJF7FFFJJFFAA-7F-A7AJF7F<7A<FFFFJJFJFJ<A7)-7F<<)7-7A)-77A7AFA7<-
Can someone please help me understand how to get the summary statistics as I see them in my first output, into a summary text file so I can view it later for all my samples?
Thank you so much.
Hi thanks for your answer! I will give it a try...so you mean I should add this 2> at the very end of my cutadapt code right? And do I need to give any file extension? assuming my summary file name is summary1. So I just write
2> summary1
is that it? or should I give some extension to the filename?Thanks again!
Yes, add it as you say. File names are optional, it will simply be a plain text document, I typically use
txt
but be aware that the name is unique in every iteration otherwise it will be overwritten. To append to the same file use2>>
.I have tried this method and it writes a blank file. -Found answer. I had specified the -o argument, so the summary was not going to standard error, it was going to standard output. Using 1> filename instead of 2>filename therefore worked.
Please show code, anecdotal descriptions are difficult to debug.
Okay thank you so much I will give it a try!
Hi I'm having more or less the same issue. I'm trying to get all the details from the summary once cutadapt had removed all the pair-end primers. My script looks like this:
cutadapt<-"/share/apps/anaconda/envs/R360/bin/cutadapt"
system2(cutadapt, args = "--version")
path.cut.Man <- file.path(pathMan, "cutadapt") if(!dir.exists(path.cut.Man)) dir.create(path.cut.Man) fnFs.cut.Man <- file.path(path.cut.Man, basename(fnFsman)) fnRs.cut.Man <- file.path(path.cut.Man, basename(fnRsman))
FWD.RC <- dada2:::rc(FWD1) REV.RC <- dada2:::rc(REV)
Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD1, "-a", REV.RC)
Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC)
Run Cutadapt
for(i in seq_along(fnFsman)) { system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,
"-o 1> report.txt", fnFs.cut.Man[i], "-p 1> report.txt", fnRs.cut.Man[i],
fnFs.filtN.Man[i], fnRs.filtN.Man[i])) }
[ 8<------] 00:00:13 174,110 reads @ 75.6 µs/read; 0.79 M reads/minute [ 8=---------] 00:00:04 63,102 reads @ 76.2 µs/read; 0.79 M reads/minute [ 8<-----] 00:00:15 205,940 reads @ 75.8 µs/read; 0.79 M reads/minute [ 8<--------] 00:00:06 85,910 reads @ 74.6 µs/read; 0.80 M reads/minute [ 8=---------] 00:00:05 66,637 reads @ 76.4 µs/read; 0.79 M reads/minute [ 8<--------] 00:00:06 83,522 reads @ 76.0 µs/read; 0.79 M reads/minute [ 8=--------] 00:00:07 100,576 reads @ 73.1 µs/read; 0.82 M reads/minute [ 8<--------] 00:00:06 84,447 reads @ 73.5 µs/read; 0.82 M reads/minute [ 8<--------] 00:00:06 81,623 reads @ 73.8 µs/read; 0.81 M reads/minute [ 8<---------] 00:00:04 58,366 reads @ 77.6 µs/read; 0.77 M reads/minute [ 8=---------] 00:00:05 79,614 reads @ 73.5 µs/read; 0.82 M reads/minute [ 8<-------] 00:00:10 139,749 reads @ 76.5 µs/read; 0.78 M reads/minute [8=----------] 00:00:01 25,220 reads @ 77.9 µs/read; 0.77 M reads/minute [ 8<--------] 00:00:06 87,436 reads @ 73.7 µs/read; 0.81 M reads/minute [ 8=---------] 00:00:05 78,237 reads @ 74.7 µs/read; 0.80 M reads/minute
It works but the arguments for the txt output files or "report.txt" only take in to account the last summary data ( the one for :[ 8=---------] 00:00:05 78,237 reads @ 74.7 µs/read; 0.80 M reads/minute) and I would like to have them all.
I would be very grateful if someone could make some observations/ corrections for this issue.