Question

Renaming reads in fastq with names of another fastq

0

Entering edit mode

8.5 years ago

CAnna ▴ 20

Hi,

I have a fastq file in which the reads names are correct but the sequences are truncated (file1) I have a second fastq file in which the sequences are full but the names do not match the names of my reads2 fastq file which is a problem for downstream analysis (fastqToSam).

Thus I need to replaces the names in my file2 by the reads names of my file1. Is there ant way to do this?

I know it sounds twisted but I am using the recent UMI-tools for a specific downstream analysis, and the tools removes the UMI from the sequence and appends it to the read name.

Thank you, CAnna

RNA-Seq • 2.7k views

ADD COMMENT • link 8.5 years ago by CAnna ▴ 20

0

Entering edit mode

So the number of records is the same in both files? Do you really need to replace them or would some simple formatting do? Could you please show first few sequences of both files?

ADD REPLY • link 8.5 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

Yes there is the same number of sequences in both files. The thing is that I am using UMI tools, which cut the UMI and append it to the read names, in both the read1 and read2 files ( I need this for my analysis)

Then I want to use my ORIGINAL read 1 file (with the entire sequence, not truncated) to make a bam file with both read 1 and read 2 (for subsequent preprocessing). The thing is that the ORIGINAL read 1 files does not have the UMI appended to the read sequences names, while the file of read 2 has it. So What I need is the names in the original read1 file to be replaced by the same as the one in read 2 file, otherwise I cannot do this bam file, since the names are not recognised as being paired.

Following is the head of the original read1 file, the transformed read1 file, and the transformed read2 file.

READ 1 ORIGINAL
@HISEQ:229:C81CCANXX:1:1101:3574:18567 1:N:0:
TGAATCGCGAGTGGTCGGCA
+
<>@B@G####
@HISEQ:229:C81CCANXX:1:1101:19291:58450 1:N:0:
TTAACCTGACTATTCC ACTG
+
BBCBBGGGGG
@HISEQ:229:C81CCANXX:1:1101:21124:77971 1:N:0:
TTAAACTTCTTAGACGAATC

READ 1 AFTER UMI tools (UMI cut and appended to read name (bases 7-16))
@HISEQ:229:C81CCANXX:1:1101:3574:18567_GCGAGTGGTC 1:N:0:
TGAATCGGCA
+
<>@B@G####
@HISEQ:229:C81CCANXX:1:1101:19291:58450_TGACTATTCC 1:N:0:
TTAACCACTG
+
BBCBBGGGGG
@HISEQ:229:C81CCANXX:1:1101:21124:77971_TTCTTAGACG 1:N:0:
TTAAACAATC

READ 2 AFTER UMI tools
@HISEQ:229:C81CCANXX:1:1101:3574:18567_GCGAGTGGTC 3:N:0:
CTCTTGCGCTTGTTCGGTTTCCGCCTGCTGCGACTAAAGAGATTCA
+
3<<:A11=/=/===1////=F1//=E0=1==/>E/9111111:>F0
@HISEQ:229:C81CCANXX:1:1101:19291:58450_TGACTATTCC 3:N:0:
ATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTG
+
3ABBAFGFFGGGGFGGGG/=CFGGGGGGGG=EFGEF:FGGFG>CGG
@HISEQ:229:C81CCANXX:1:1101:21124:77971_TTCTTAGACG 3:N:0:
TACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGA

Thank you, Camille

ADD REPLY • link 8.5 years ago by CAnna ▴ 20