cutadapt summary statistics after every run
2
0
Entering edit mode
4.9 years ago
bnayer26 • 0

Hi, I am new to analyzing RNA seq data and have just started running cutadapt to trim my adapter sequences from my paired end data. If I run one line of code at a time, I can see a very nice summary statistics as is given in this page of the cutadapt documentation (https://cutadapt.readthedocs.io/en/stable/guide.html#cutadapt-s-output). It looks something like this:

    This is cutadapt 2.7 with Python 3.5.2
Command line parameters: --cores=14 -q 10,10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --info-file=summary1.txt -o EBSP3_L3_1_trimmed.fq.gz -p EBSP3_L3_2_trimmed.fq.gz /home/bnay2/JeanProject/JeanRawData/EBSP3_L3_1.fq.gz /home/bnay2/JeanProject/JeanRawData/EBSP3_L3_2.fq.gz
Processing reads on 14 cores in paired-end mode ...
[    8=------] 00:34:05    31,431,443 reads  @     65.1 µs/read;   0.92 M reads/minute
Finished in 2045.05 s (65 us/read; 0.92 M reads/minute).

=== Summary ===

Total read pairs processed:         31,431,443
  Read 1 with adapter:                 866,710 (2.8%)
  Read 2 with adapter:                 839,966 (2.7%)
Pairs written (passing filters):    31,431,443 (100.0%)

Total basepairs processed: 9,429,432,900 bp
  Read 1: 4,714,716,450 bp
  Read 2: 4,714,716,450 bp
Quality-trimmed:               3,496,279 bp (0.0%)
  Read 1:     1,212,666 bp
  Read 2:     2,283,613 bp
Total written (filtered):  9,420,283,738 bp (99.9%)
  Read 1: 4,710,662,353 bp
  Read 2: 4,709,621,385 bp

=== First read: Adapter 1 ===

Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 3'; Length: 34; Trimmed: 866710 times.

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp: 3

Bases preceding removed adapters:
  A: 29.7%
  C: 30.1%
  G: 27.9%
  T: 12.3%
  none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   697484  491116.3    0   697484
4   129061  122779.1    0   129061
5   33700   30694.8 0   33700
6   2812    7673.7  0   2812
7   982 1918.4  0   982
8   144 479.6   0   144

But I wanted to run the code in a loop as part of a script, and I wanted to save these summary statistics information for each trimming into a text file where I can view it later.

I see that the documentation mentions this:

Format of the info file When the --info-file command-line parameter is given, detailed information about the found adapters is written to the given file. The output is a tab-separated text file. Each line corresponds to one read of the input file (unless –times is used, see below). A row is written for all reads, even those that are discarded from the final output FASTA/FASTQ due to filtering options (such as --minimum-length).

But when I tried --info-file=summary.txt, it generated a 12 GB file that has multiple lines of the following, which is very different from the kind of output I used to get shown above:

E00552:427:H7CVVCCX2:3:1101:5071:1379 1:N:0:NAGAGAGG+NACTCCTT   -1      ATCCTGTGAGTGTGTGTATACAGATATTATAGAAATGCTTTTAGGCATCTTTGAAACCAAGCCTATGTGTGAATAGTTTGTGAAAGAGATGGCAAACTCGGATGGGGGAATACCCAAGGGTCATGGGTTTTATGTGTCGCTTTGGTGTA   AAFFJJJJJJJJFJFJJJJFJJJJJJJJJJJJJJJJJFJJJJJJFJJJJJJJJJJJJJJJJJJJJFJ<JJJJJJJ<A-<<AJJJFFAJFJJJJJFFA-A-JF7A<<J-7<AFJ7-A<F<J7AJF7AJ7---<FA-A--<)7--<<-<AA
E00552:427:H7CVVCCX2:3:1101:5294:1379 1:N:0:NAGAGAGG+NACTCCTT   -1      AAGGGGATAGGTGAGTGGCACGCCAAAGCCATGTGGGGGTCAGGGTACCAGAACTAGGCCCTGGTGCAGGAATACATCAGGCAGGAAGGGGGTGACAAAGGGGTCTGTGGGCTGCATCCATCTGGTTCTTGACTGTCGCTTATACACAT   AAFFJJFJJJJAJJJFJJJ<AJJJJJJJJJJFJFAJJ<F<JJJJJFJJJFFJJJJFAJJFJJJJFJF-AJJJJJJJJ-JF<A<AFF<FAJ<-7<-A<-F<AA-7AAJ-7FAFJJJFAJ-7-<-F---7<FFJJ<AAA)A<FFAJ<AFJF
E00552:427:H7CVVCCX2:3:1101:5538:1379 1:N:0:NAGAGAGG+NACTCCTT   -1      TCTTTCTCTAAAGAGCTGCTTCTAATCTCCTGCTGGAGGCCTCCTTGAGCTCTCAGACATGGCTTCTGCCAACCCAACCTGCTGCTGAGTAAGCCAAAGACTCTCCTTCCTGTAGAAACCAGCAAGGAGCGCCCTGCCCCTGTCCTTTT   AAAFJJFJJJJJJJJJJJJ<FJJJJJJAJJJJJJFJJ-AJJJJJJJJJJJJJJJJJFJJJJFJJJJJJJJJJJJJFA-FJ-FJAFJJF7FFFJJFFAA-7F-A7AJF7F<7A<FFFFJJFJFJ<A7)-7F<<)7-7A)-77A7AFA7<-

Can someone please help me understand how to get the summary statistics as I see them in my first output, into a summary text file so I can view it later for all my samples?

Thank you so much.

RNA-Seq adapter trimming cutadapt • 6.3k views
ADD COMMENT
1
Entering edit mode
4.9 years ago
ATpoint 85k

This is printed to stderr so you can capture it basically like:

for i in files*
  do
  cutadapt (...) 2> summary_for_${i}
  done

The 2> does the trick. Once you have this you could summarize all reports into a nice html report using multiqc.

ADD COMMENT
0
Entering edit mode

Hi thanks for your answer! I will give it a try...so you mean I should add this 2> at the very end of my cutadapt code right? And do I need to give any file extension? assuming my summary file name is summary1. So I just write 2> summary1 is that it? or should I give some extension to the filename?

Thanks again!

ADD REPLY
0
Entering edit mode

Yes, add it as you say. File names are optional, it will simply be a plain text document, I typically use txt but be aware that the name is unique in every iteration otherwise it will be overwritten. To append to the same file use 2>>.

ADD REPLY
2
Entering edit mode

I have tried this method and it writes a blank file. -Found answer. I had specified the -o argument, so the summary was not going to standard error, it was going to standard output. Using 1> filename instead of 2>filename therefore worked.

ADD REPLY
0
Entering edit mode

Please show code, anecdotal descriptions are difficult to debug.

ADD REPLY
0
Entering edit mode

Okay thank you so much I will give it a try!

ADD REPLY
0
Entering edit mode

Hi I'm having more or less the same issue. I'm trying to get all the details from the summary once cutadapt had removed all the pair-end primers. My script looks like this:

cutadapt<-"/share/apps/anaconda/envs/R360/bin/cutadapt"

system2(cutadapt, args = "--version")

path.cut.Man <- file.path(pathMan, "cutadapt") if(!dir.exists(path.cut.Man)) dir.create(path.cut.Man) fnFs.cut.Man <- file.path(path.cut.Man, basename(fnFsman)) fnRs.cut.Man <- file.path(path.cut.Man, basename(fnRsman))

FWD.RC <- dada2:::rc(FWD1) REV.RC <- dada2:::rc(REV)

Trim FWD and the reverse-complement of REV off of R1 (forward reads)

R1.flags <- paste("-g", FWD1, "-a", REV.RC)

Trim REV and the reverse-complement of FWD off of R2 (reverse reads)

R2.flags <- paste("-G", REV, "-A", FWD.RC)

Run Cutadapt

for(i in seq_along(fnFsman)) { system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,
"-o 1> report.txt", fnFs.cut.Man[i], "-p 1> report.txt", fnRs.cut.Man[i],
fnFs.filtN.Man[i], fnRs.filtN.Man[i])) }

[ 8<------] 00:00:13 174,110 reads @ 75.6 µs/read; 0.79 M reads/minute [ 8=---------] 00:00:04 63,102 reads @ 76.2 µs/read; 0.79 M reads/minute [ 8<-----] 00:00:15 205,940 reads @ 75.8 µs/read; 0.79 M reads/minute [ 8<--------] 00:00:06 85,910 reads @ 74.6 µs/read; 0.80 M reads/minute [ 8=---------] 00:00:05 66,637 reads @ 76.4 µs/read; 0.79 M reads/minute [ 8<--------] 00:00:06 83,522 reads @ 76.0 µs/read; 0.79 M reads/minute [ 8=--------] 00:00:07 100,576 reads @ 73.1 µs/read; 0.82 M reads/minute [ 8<--------] 00:00:06 84,447 reads @ 73.5 µs/read; 0.82 M reads/minute [ 8<--------] 00:00:06 81,623 reads @ 73.8 µs/read; 0.81 M reads/minute [ 8<---------] 00:00:04 58,366 reads @ 77.6 µs/read; 0.77 M reads/minute [ 8=---------] 00:00:05 79,614 reads @ 73.5 µs/read; 0.82 M reads/minute [ 8<-------] 00:00:10 139,749 reads @ 76.5 µs/read; 0.78 M reads/minute [8=----------] 00:00:01 25,220 reads @ 77.9 µs/read; 0.77 M reads/minute [ 8<--------] 00:00:06 87,436 reads @ 73.7 µs/read; 0.81 M reads/minute [ 8=---------] 00:00:05 78,237 reads @ 74.7 µs/read; 0.80 M reads/minute

It works but the arguments for the txt output files or "report.txt" only take in to account the last summary data ( the one for :[ 8=---------] 00:00:05 78,237 reads @ 74.7 µs/read; 0.80 M reads/minute) and I would like to have them all.

I would be very grateful if someone could make some observations/ corrections for this issue.

ADD REPLY
0
Entering edit mode
2.5 years ago

summary statistics can be saved by

cutadapt command > summary_statistics.txt

stderr can be saved by

cutadapt command 2> stderr.txt
ADD COMMENT

Login before adding your answer.

Traffic: 2040 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6