Entering edit mode
5.0 years ago
Dave Th
▴
60
Hi all,
I'm trying to filter my contigs dataset into different files by their length such as 500bp, 1kb, 2kb... I'm using below code to produce my output.
def contigs_filter_by_length(fasta_input, size, fasta_output):
long_contigs = [] #Create an empty list
for record in SeqIO.parse(fasta_input,"fasta"):
if len(record.seq) >= size:
long_contigs.append(record)
print("Found %i contigs" %len(long_contigs))
SeqIO.write(long_contigs,fasta_output,"fasta")
The problem is when I crosschecked with QUAST report of my input file and the output from the code, there was a huge difference between them. QUAST indicated that there are 119787 contigs >= 500bp while the fasta output from the code showed 122046 contigs >=500bp.
Is there anything wrong in my code which lead to this difference?
I haven't seen anything wrong in your code, have you compared the results? You can find some contigs reported by your python code while not by QUAST to see what caused the difference
I think this might be the key.
QUAST may be doing some additional filtering of 'junk' sequences which are obvious misassembly artefacts or deduplication.
Not 100% for certain, but that would be my immediate guess.
for what "SeqIO.parse" stands for? (trying to understand the command) I'm trying to filter contigs so this code can help me.
That is standard SeqIO interface included in Biopython (LINK).
Hello Dave, iḿ trying to use your code for filtering some contigs, but I got a identation error message:
so I suppose that I must add something on the double brackets?
Regards :)
IndentationError: expected an indented block
?The code in the first post has incorrect indentation levels for python. You should not copy it verbatim.