adding suffix /1 & /2 to PE data - abyss input data
5
0
Entering edit mode
9.2 years ago

Hi Abyssers,

I have run trimmomatric on my PE data which generates R1 & R2 files that do not have any suffixes. Now it is not obvious to me whether I must add the /1 & /2 to each read name or by simply telling Abyss that the reads are pairs using pe='r1.fastq r2.fastq' it should recognise the pairs and get on with the assembly correctly.

/SB

abyss assembly • 5.6k views
ADD COMMENT
3
Entering edit mode
9.2 years ago

Alternatively, with BBMap:

reformat.sh in=r1.fq in2=r2.fq out1=renamed1.fq out2=renamed2.fq addslash
ADD COMMENT
0
Entering edit mode

did not work for me because the names were of the type

@HISEQ:149:C76YNACXX:3:1101:1159:2191 1:N:0:GCCAAT

and MBBMap added the /1 to look like this

@HISEQ:149:C76YNACXX:3:1101:1159:2191 1:N:0:GCCAAT /1

instead of producing

@HISEQ:149:C76YNACXX:3:1101:1159:2191/1

for me, the solution (thanks @Salim!) was to use sed (spaces below are voluntary)

cat reads_1.fq | sed -e 's, 1:N:0:ATCACG,/1,g' > corrected1.fq

cat reads_2.fq | sed -e 's, 2:N:0:ATCACG,/2,g' > corrected2.fq

ADD REPLY
2
Entering edit mode
9.2 years ago
SES 8.6k

Yes, I believe that abyss requires the pair information to be present (either 1/2, forward/reverse or A/B) and the files may be separate or interleaved. You can add the pair information back with Pairfq. Here is an example (requires curl and perl):

curl -sL git.io/pairfq_lite | perl - addinfo -i R1.fq -o R1_info.fq -p 1
curl -sL git.io/pairfq_lite | perl - addinfo -i R2.fq -o R2_info.fq -p 2

That should go pretty fast and the input can be fasta or fastq (compressed is fine also I believe).

ADD COMMENT
1
Entering edit mode
9.2 years ago
Adrian Pelin ★ 2.6k

Solutions posted so far would work great. I just remembered an old blogpost where there was a onliner to convert new illumna naming scheme to old using this one liner:

cat new-style_.fastq | awk '{if (NR % 4 == 1) {split($1, arr, ":"); printf "%s_%s:%s:%s:%s:%s#0/%s (%s)\n", arr[1], arr[3], arr[4], arr[5], arr[6], arr[7], substr($2, 1, 1), $0} else if (NR % 4 == 3){print "+"} else {print $0} }' > old-style.fastq

https://contig.wordpress.com/2011/09/01/newbler-input-iii-a-quick-fix-for-the-new-illumina-fastq-header/

It's kinda nice since you really are not relying on any other tools, just bash and good ol' awk. I think this will only work if you do have the new header (something like 1:N:0 and 2:N:0), it may not if you have no info about pairs in your header.

ADD COMMENT
1
Entering edit mode
9.2 years ago

Thank you lads. I went ahead and assumed that the suffixes are important. Thank you for confirming that.

I used BBMap's reformatter script for this.

I have a related question though. I run abyss-2fastq on some data I was analysing a week ago and it added /1/2 not at the very end but towards the end before the last few characters e.g. below. Is this recognisable by Abyss?

@HISEQ:149:C76YNACXX:3:1101:1159:2191/1 1:N:0:GCCAAT
--------------------------------------^
ATAATTAAAGCAGGAATAGTAAAAAAACGTCCCTTAAAACGTATCAAGAAATCCGACCCAGACTGGGATTACGCAACCTGCGACGGCCCGTTGTGCCTGCG
+
BBBFFFFFFFFFFIBFIFIIIIIIIIIIIIIIIFIFFIIIFFFIBFIIIIIIFFFIFFFFFFFFFFFBBFFFFBFFFFFFBBFFFFFFF<BFFBFBBFFFF
@HISEQ:149:C76YNACXX:3:1101:1159:2191/2 2:N:0:GCCAAT
--------------------------------------^
AACCTTGCGACGACCTGAAGGACGGACCGTCGCAGGCACAACGGGCCGTCGCAGGTTGCGTAATCCCAGTCTGGGTCGGATTTCTTGATACGTTTTAAGGG
+
BBBFFFFFFFFFFFIIIIFFIFIIIFFFFIFFFFFFFFFFFFFBFBBFFF7<77B<BB<BBBBFFFFBBBFBFFF<BBF7BBBBFFB<BBFBBFFF<<BBF
ADD COMMENT
0
Entering edit mode

hard to tell, but these can be easily removed....

sed -i 's, 1:N:0:GCCAAT,,g' file_r1.fastq
ADD REPLY
0
Entering edit mode

Hi Salim,

ABySS treats the first whitespace-separated word in the line as the read ID, so there is no need to remove the 1:N:0:GCCAAT or 2:N:0:GCCAAT. Everything after the first space is considered to be a comment/description.

ADD REPLY
1
Entering edit mode
9.2 years ago

Hello again,

I received a reply from Ben Vandervalk who is one of the authors of Abyss and it goes as follows:

pe="r1.fastq r2.fastq" should suffice.

ABySS requires that either:

(i) the read names for both reads are identical, OR (ii) the read names have an identical prefix, followed by "/1" and "/2", respectively.

- Ben

ADD COMMENT

Login before adding your answer.

Traffic: 3292 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6