Question

CUTADAPT in windows

0

Entering edit mode

9.6 years ago

zizigolu ★ 4.3k

Hello all

I have two fastq files, first set are the footprints, while the second sets are data coming from a common RNA-seq experiment performed in parallel, I want to adapter sequences, using the software CUTADAPT

The adapter sequences are:

GTTCAGAGTTCTACAGTCCGACGATC # 5 prime adapter sequence

ATCTCGTATGCCGTCTTCTGCTTG # 3 prime adapter sequence

May you please tell whether and how I can do this project with windows seven 64 bit?? because I am not familiar enough with linux

Help me please

sequencing • 7.7k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

2

Entering edit mode

Read this documentation properly. https://cutadapt.readthedocs.org/en/stable/

ADD REPLY • link updated 21 months ago by Ram 44k • written 9.6 years ago by Deepak Tanwar ★ 4.2k

Ram · Accepted Answer · 2015-05-05

6

Entering edit mode

9.6 years ago

Brian Bushnell 20k

You can do that in Windows with BBDuk, if you install Java. Assuming you extract to C:\bbmap\, and you have paired reads named read1.fq and read2.fq:

java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF in=read#.fq out=trimmed#.fq ktrim=r k=23 mink=12 hdist=1 tbo tpe literal=GTTCAGAGTTCTACAGTCCGACGATC,ATCTCGTATGCCGTCTTCTGCTTG

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you

Brian, I downloaded BBMap_34.92.tar.gz and java 8 and installed java,,

where I should type

java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF in=read#.fq out=trimmed#.fq ktrim=r k=23 mink=12 hdist=1 tbo tpe literal=GTTCAGAGTTCTACAGTCCGACGATC,ATCTCGTATGCCGTCTTCTGCTTG

In command prompt?

You mean first I should extract BBMap_34.92.tar.gz then what is the later step?

Please help me more if possible

When I typed your command in cmd, it told can't find or load main class of jgi BBDukf

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

3

Entering edit mode

Hi Sarah,

First you need to extract BBMap. You can do that with 7-zip (a great program); just right-click and choose 7-zip then "extract here". You need to do this once for the .gz file and then once for the tar file.That will create a directory called bbmap, which you can put in C:.

Then (for simplicity) move your sequence files to a new directory called C:\sequence\.

Then open a command prompt (typically by typing cmd at the bottom of the start menu). Type cd C:\sequence.

Then you type the java command, which should work.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

thanks a lot Brian

ADD REPLY • link 9.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Sorry Brian

May you tell me please, how I select the minimal length of the remaining fragments and to trim also by the quality of the reads?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

1

Entering edit mode

If you want to quality trim to (for example) Q15, add these flags:

qtrim=r trimq=15

To throw away reads shorter than 20bp after trimming:

minlength=20

These and all other flags are described in C:\bbmap\bbduk.sh, which can be opened with Wordpad (not notepad).

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

3

Entering edit mode

thank you Brian for your patience and kindness

...I am just a beginner and I have to work on NGS because of my thesis

ADD REPLY • link 9.6 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Bachelor or Masters?

ADD REPLY • link 9.6 years ago by Deepak Tanwar ★ 4.2k

1

Entering edit mode

I am shy to tell but I am a phd student but i have not already work in bioinformatics field at all and recently i had to ...

ADD REPLY • link 9.6 years ago by zizigolu ★ 4.3k

2

Entering edit mode

Nothing to shy about :-)

ADD REPLY • link 9.6 years ago by Deepak Tanwar ★ 4.2k

0

Entering edit mode

Brian,

In cmd, I entered:

cd C:\sequencec

(which sequence file was contain of gruseq.fa and raw.qual)

then I typed:

java -ea -Xmx1g -cp C:\Users\yang\Documents\Downloads\bbmap\current\jgi.BBDukF in=raw.qual out=clean.fq ref=gruseq.fa ktrim=r k=28 mink=12 hdist=1

but it says that:

BBDUK VERSION 34.92
maskmiddle was disabled because useshortkmers=true

My Java is: jre-8u45-windows-x64_2.exe

Sorry Brian, I tried with another option you suggested but the same error

May you please tell me the reason if possible..

Thank you

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

2

Entering edit mode

That's not an error, just a notification. When doing adapter-trimming, maskmiddle is disabled. It's intended for filtering operations, so that's fine.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you

I mean why I can not get some result like which I copied from your thread

Total output: 966786 reads 113303242 bases
Perfectly Correct (% of output): 901541 reads (93.251%) 103689866 bases (91.515%)
Incorrect (% of output): 65245 reads (6.749%) 9613376 bases (8.485%)
Adapters Remaining (% of adapters): 65243 reads (13.973%) 1229480 bases (1.085%)
Non-Adapter Removed (% of valid): 2 reads (0.000%) 27 bases (0.000%)

or anything else to present to my supervisor?

for example he asked me: "A quick check you might do would be to control the initial number of entries (reads) included in the "raw data" fastq file and compare it with the number of entries (reads) which are in the final fastq file. The number of entries in the "final" file should be less.

but I don't know where I should search for read counts

in cmd just there is something about java and run time spent

Excuse me...

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Those specific results are from analyzing the output of trimming operations on synthetic data, in which the answer is known. For real data, you don't know the correct answer, so the output will look more like this:

C:\temp\ecoli>java -ea -Xmx1g jgi.BBDukF in=reads_B2_100x100bp_0S_0I_0D_0U_0N.fq ktrim=r k=23 hdist=1 ref=D:\temp\BBTools_public\bbmap\resources\truseq.fa.gz
Executing jgi.BBDukF [in=reads_B2_100x100bp_0S_0I_0D_0U_0N.fq, ktrim=r, k=23, hdist=1, ref=D:\temp\BBTools_public\bbmap\resources\truseq.fa.gz]

BBDuk version 34.93
No output stream specified.  To write to stdout, please specify 'out=stdout.fq' or similar.
Initial:
Memory: free=494m, used=19m

Added 42113 kmers; time:        0.155 seconds.
Memory: free=469m, used=44m

Input is being processed as unpaired
Processing time:                0.067 seconds.

Input:                          100 reads               10000 bases.
KTrimmed:                       0 reads (0.00%)         0 bases (0.00%)
Result:                         100 reads (100.00%)     10000 bases (100.00%)

Time:                           0.243 seconds.
Reads Processed:         100    0.41k reads/sec
Bases Processed:       10000    0.04m bases/sec

That tells you the number of reads and bases in (input) and the number of reads and bases out (result).

But, your command is wrong:

java -ea -Xmx1g -cp C:\Users\yang\Documents\Downloads\bbmap\current\ jgi.BBDukF in=raw.qual out=clean.fq ref=gruseq.fa ktrim=r k=28 mink=12 hdist=1

should be something like

java -ea -Xmx1g -cp C:\Users\yang\Documents\Downloads\bbmap\current\ jgi.BBDukF in=raw.fastq out=clean.fq ktrim=r k=28 mink=12 hdist=1 literal=GTTCAGAGTTCTACAGTCCGACGATC,ATCTCGTATGCCGTCTTCTGCTTG

The input has to be a fastq or fasta file, not a qual file. And ref=gruseq.fa was just used as an example for the synthetic test; in real life you should specify the actual adapter sequences (in your case, with literal=).

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Sorry Brian for disturbance,

I should read your reply many times exactly perhaps I can do whichever my supervisor asked me

Thank you

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Good morning Brian,

My supervisor just has given me the accession number GSE35641 and asked me trim and keep reads having more than 15 bp..

I don't have adapter sequence, how I know what is adapter sequence in my accession number?

Thank you

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

2

Entering edit mode

If your reads are paired, you can find the adapter sequence with BBMerge:

java -ea -Xmx1g -cp C:\Users\yang\Documents\Downloads\bbmap\current\ in=read#.fq outa=adapter.fa

Otherwise, you can try the different adapter sequences in bbmap\resources; it's most likely to be truseq.fa.gz but it could be nextera.fa.gz.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks a lot

I will try your tips now

ADD REPLY • link 9.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Brian,

I need your help as already

from where I can download BBMerge? does BBMerge embed in BBMap and no need to be downloaded?

using this command in=reads.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=r trimq=10

I have two clean.fq files now, if I want to align the trimmed reads against a reference genome, which of those files (clean.fq 1 and 2) I should use?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

1

Entering edit mode

BBMerge, BBDuk, and BBMap are all together in the BBMap download.

When aligning paired reads, you should use both files at the same time. Please note that if you run BBDuk with one input and 2 outputs, it will assume the input is interleaved split the input into two files. You can, alternately, just specify one output file to keep it interleaved.

To map these to a reference, you can do so like this with BBMap:

java -Xmx1g -ea -cp C:\Users\yang\Documents\Downloads\bbmap\current\ align2.BBMap in1=clean1.fq in2=clean2.fq out=mapped.sam ref=reference.fasta

That will map the files to the specified reference. However, the amount of memory required depends on the size of the reference file. For the human genome, for example, you would need to change -Xmx1g to -Xmx22g. BBMap generally requires approximately 7 bytes per base pair of reference sequence.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

thanks Brian,

ADD REPLY • link 9.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Brian,

My supervisor told

For the rRNA filtering part (i.e. mapping the fastq reads on the rRNA sequences and keeping only the reads which do not map): I should type

bowtie2 -x [name of the bowtie2-build indicized file containing the rRNA sequence] --un [name of the fastq file which will contain the UNMAPPED reads] -U [name of the fastq file containing the reads] -S [name of the .sam file that will contain the MAPPED and UNMAPPED reads]

I could not understand his mean about --un option because I don't know which I should type instead of [name of the fastq file which will contain the UNMAPPED reads]

May you please help me although this irrelevant to sequence trimming.

Excuse me

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Did the command I suggested not work?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Sorry Brian,

I have just one read file then I thought it is not paired read, I trimmed my reads based on your previous tips, in the other hand my supervisor newly asked me to align my clean.fq file against reference genome, but I don't know which I should type in [name of the fastq file which will contain the UNMAPPED reads]. in bowtie2 manual I read that

--un <path> write unpaired reads that fail to align to file at <path>. These reads correspond to the SAM records with the FLAGS 0x4 bit set and neither the 0x40 nor 0x80bits set. If --un-gz is specified, output will be gzip compressed. If --un-bz2 or --un-lz4 is specified, output will be bzip2 or lz4 compressed.

but I didn't get which I should type after --un

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Brian,

I have a sam file containing unmapped reads, I am going to convert my sam file to fasta or fastq to be a input in bowtie2 to align on genome.

May you please tell how I can convert sam to fasta or fastq by reformat in BBDuk?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Pretty straightforward:

reformat.sh in=unmapped.sam out=unmapped.fastq

In Windows that would be:

java -ea -Xmx1g -cp path\to\BBMap\current jgi.ReformatReads in=unmapped.sam out=unmapped.fastq

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

2

Entering edit mode

That is it Brian,

You clever boy deserve to be appreciated, your code worked well.

Thank you

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Brian,

Can I use BBDuk to see genome coverage? I tried bedtools to produce bed graph and bed histogram but it doesn't work on windows, also I tried samtools but I just found some weird command. I have a bam file that I want to see genome coverage of.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

You can do that with Pileup, not with BBDuk.

pileup.sh in=mapped.bam covstats=stats.txt hist=histogram.txt basecov=basecov.txt bincov=bincov.txt binsize=1000

basecov will give you the exact coverage at every base location, so that file will be huge if you have a large genome; so you may want to skip that one. bincov gives the coverage binned by every 1000bp (by default) so is a lot smaller and easier to plot in, say, Excel.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you very much Brian,

I will try and will ask question as already.

Thanks again

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Brian,

I typed (I'm in windows)

java -ea -Xmx1g -cp path\to\BBMap\current jgi.pileup in=eg1.bam covstats=stats.txt hist=histogram.txt basecov=basecov.txt bincov=bincov.txt binsize=1000

but it says that fail to find main class pileup

I typed the code properly? Even jgi.pileup?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Ah, sorry, in Windows it would be jgi.CoveragePileup.

And -Xmx1g may or may not be enough depending on the reference. That's fine for a small genome but for a human-sized you will need at least -Xmx8g.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you Brian,

I typed:

java -ea -Xmx1g -cp C:\Users\yang\Documents\Downloads\BBMap\current jgi.CoveragePileup in=sim_reads_aligned.bam covstats=stats.txt hist=histogram.txt basecov=basecov.txt bincov=bincov.txt binsize=1000

I am working with ecoli

It says that:

set USE_COURAGE_ARRAYS to true
set USE_BITSETS to false
exception in thread "main" java.lang .ASSERTAIONERROR: -128 = ?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Can you please copy and paste the full error, include all the information above it?

Oh - I should mention, that command will not work unless samtools is installed and in your path. If samtools is not installed, Pileup can only read the file if you first convert it to sam format.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Sorry Brian for delay,

I installed samtools already. My bam file is in samtools folder

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.
C:\Users\yang>cd C:\Users\yang\Documents\Downloads\samtools
C:\Users\yang\Documents\Downloads\samtools>java -ea -Xmx1g -cp C:\Users\yang\Documents\Downloads\BBMap\current jgi.CoveragePileup in=eg1.bam covstats=stats.txt hist=histogram.txt basecov=basecov.txt bincov=bincov.txt binsize=1000
Executing jgi.CoveragePileup [in=eg1.bam, covstats=stats.txt, hist=histogram.txt, basecov=basecov.txt, bincov=bincov.txt, binsize=1000]

Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Exception in thread "main" java.lang.AssertionError: Missing field 1: ▼♦     ÿ♠BC☻ ? }?»
        at stream.SamLine.<init>(SamLine.java:472)
        at jgi.CoveragePileup.processSamLine(CoveragePileup.java:675)
        at jgi.CoveragePileup.process(CoveragePileup.java:351)
        at jgi.CoveragePileup.main(CoveragePileup.java:49)

C:\Users\yang\Documents\Downloads\samtools>

and for my another bam file:

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.
C:\Users\yang>cd C:\Users\yang\Documents\Downloads\samtools
C:\Users\yang\Documents\Downloads\samtools>java -ea -Xmx1g -cp C:\Users\yang\Documents\Downloads\BBMap\current jgi.CoveragePileup in=sim_reads_aligned.bam covstats=stats.txt hist=histogram.txt basecov=basecov.txt bincov=bincov.txt binsize=1000
Executing jgi.CoveragePileup [in=sim_reads_aligned.bam, covstats=stats.txt, hist=histogram.txt, basecov=basecov.txt, bincov=bincov.txt, binsize=1000]
Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Exception in thread "main" java.lang.AssertionError: -128 = ?array=▼♦     ÿ♠ BC☻ ? srôe?b``p?pá♀ó?2?3à♀ö·*?+?/*IMá♫ä♀ö?J?¬144031024?)JM«ñs?70°025Ö3¬áôñ?2±4?°42àpçôt±J?//ÉL5â♀?ƒ3??←é☺?çb♦?&Äx?dX◄??  B?Bú?   ♦     ÿ♠ BC☻ g%?}[?-?uÖ?ó?☻$▲Ü?Fô♥²èN+n♀??Kß?xè?è♫æw@?♫??Ü?9g&?C$?↔#¶ƒ??▬4►@ ?☻?♦?P$Ä♂?Ä♥◄y@AÜ?¶       °►, start=247, stop=249
        at align2.Tools.parseInt(Tools.java:1508)
        at stream.SamLine.<init>(SamLine.java:473)
        at jgi.CoveragePileup.processSamLine(CoveragePileup.java:675)
        at jgi.CoveragePileup.process(CoveragePileup.java:351)
        at jgi.CoveragePileup.main(CoveragePileup.java:49)

C:\Users\yang\Documents\Downloads\samtools>

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Well... it looks like, for whatever reason, Pileup can't find Samtools. I suggest you first convert the bam to sam and then run Pileup on the sam file.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you,

Then I will try with sam.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Yes Brian,

I tried with sam file and worked well.

Thank you

How can I show the result in a graph?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Paste bincov.txt into Excel, and make a scatter plot...

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you very much

Sorry Brian,

Can I use BBDuk to convert sra, wig or txt to fasta or fastq? Even though I could not find them in reformat's accepted formats in your thread in SEQanswer.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Nope, the only thing I know of that will convert sra files is the sra toolkit.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k

2

Entering edit mode

Sarah,

Look at these:

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by Deepak Tanwar ★ 4.2k

1

Entering edit mode

Thanks Deepak

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Sorry Brian,

Can I use reformat to convert txt.gz or wig.gz files to fasta or fastq?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Reformat will only convert reads; the only formats it will accept are fasta, fastq, fasta+qual, scarf, sam, and bam (if samtools is in the path). It will accept gzipped or not gzipped. But there's nothing that can convert a wig file to reads, because the data is no longer there. As for .txt, that's a generic extension that could mean anything, so it depends on the contents, but probably not.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by zizigolu ★ 4.3k