Hi All, I used the following python script to add sample identifier (S1R1, S1R2, S2R1, S2R2 etc) based on the adaptor combination. It did add the sample name as I needed but when I used the output files in Bowtie to get bam files, the sample identifier at the end doesn’t show up. (I used Geneious and IGV to visualize the bam files.) Can anybody tell me what has gone wrong? My intention is to easily identify to which sample each read belongs to. Thanks!!
replace ID_TSV file
1:N:0:ATTACTCG+TATAGCCT S1R1
2:N:0:ATTACTCG+TATAGCCT S1R2
1:N:0:TCCGGAGA+TATAGCCT S2R1
2:N:0:TCCGGAGA+TATAGCCT S2R2
1:N:0:CGCTCATT+TATAGCCT S3R1
2:N:0:CGCTCATT+TATAGCCT S3R2
1:N:0:GAGATTCC+TATAGCCT S4R1
2:N:0:GAGATTCC+TATAGCCT S4R2
1:N:0:ATTACTCG+ATAGAGGC S5R1
2:N:0:ATTACTCG+ATAGAGGC S5R2
1:N:0:TCCGGAGA+ATAGAGGC S6R1
2:N:0:TCCGGAGA+ATAGAGGC S6R2
1:N:0:CGCTCATT+ATAGAGGC S7R1
2:N:0:CGCTCATT+ATAGAGGC S7R2
1:N:0:GAGATTCC+ATAGAGGC S8R1
2:N:0:GAGATTCC+ATAGAGGC S8R2
# Dictionary with strings to replace and what to replace them with
replace_strings = {}
with open("replace_ids.tsv", "r") as id_file:
# Read file line-by-line
for line in id_file.readlines():
# Split line on TAB
ids = line.strip().split("\t")
# Fist entry is the original ID
original_id = ids[0]
# Second entry is your ID
my_id = ids[1]
# Add both to our dictionary of strings to replace
replace_strings[original_id] = my_id
# Read file with sequences, called "sequence.txt"
with open("S8_R2_p.fastq", "r+") as infile:
# Read each line of file into a list
content = infile.readlines()
# Keep a list of the lines with the replaced strings
new_content = []
# Loop lines in the file content
for line in content:
new_line = line
# Find and replace any original_id with your own ids in the line of content and add it to our list of replaced lines
for original_id, my_id in replace_strings.items():
new_line = new_line.replace(original_id, my_id)
new_content.append(new_line)
# Write replaced content to a new file called "outfile.txt"
with open("outfileS8R2.fastq", "w") as outfile:
for line in new_content:
outfile.write(line)
Hi finswimmer,
Thanks for trying to help. May be my question is not clear. Anyways, I don't have different samples in one fastq file. I got demultiplexed fastq files for each sample so I didn't have to demultiplex it myself. But when I assembled all the samples together using Bowtie2 and visualized them using either IGV or Geneious I can't identify to which sample each read belongs to. It has only X and Y cordinates by which I can't directly identify the sample (@M04503:27:000000000-G2K2K:1:1101:14373:1561). Therefore, I thought to add the sample no. to the end of each reads' identifier by using above script. It worked and now it is like this, with S1R2 at the end (@M04503:27:000000000-G2K2K:1:1101:14373:1561 S1R2).
My question is even though I did this, still I can't see the S1R2 part when I'm visualizing the alignment/assembly.
Thanks!!
Hello again,
the most clean way is still to do the alignment separate for each sample and adding a ReadGroup containing the sample name. Doing so the read group can be used by various tools for further analyses.
I see that there is also in
--sam-no-qname-trunc
option inbowtie2
whichSuppress standard behavior of truncating readname at first whitespace at the expense of generating non-standard SAM
. But again: I strongly recommend using ReadGroups to differ the samples.fin swimmer
Alrighty,,, I'll try that. Thanks again!!