Counting reads and bases from a list of fastq files
0
0
Entering edit mode
4.9 years ago
caro-ca ▴ 20

Hi! I trimmed my Illumina short reads, forward and reverse, by using Trimmomatic. The Trimmomatic's outputs were: paired_1 - unpaired_1, and paired_2 - unpaired_2.fastq.gz files. I want to know how big was the impact of trimming by counting the number of reads and bases of each file in my directory. I had made a script to count the number of bases and reads for each file in my directory; however, I have problems in if __name__=='__main__'. When I do the for loop I don't know the order of the files that will be run, how can I make it to call the files by the order I see from the screen? Additionally, I also need help with correcting the script as I don't get any stdout. Thank you in advance for your help.

#!/usr/bin/env python

from sys import argv 
import os 

def get_num_bases(file_content):
    total = []
    for linenumber, line in enumerate(file_content):
        mod=linenumber%4

        if mod==0: 
            ID = line.strip()[1:]
            #print(ID)

        if mod==1: 

            seq = line.strip()
            counting = 0
            counting += seq.count("T")+ seq.count("A") + seq.count("C") + seq.count("G")
            total.append(counting)
            allbases = sum(total)
            print("Number of bases are: " , allbases)

def get_num_reads(file_content):

    total_1 = []

    for line in file_content: 
        num_reads = 0 
        num_reads += content.count(line)  
        total_1.append(num_reads)
        print("Number of reads are: ", sum(total_1)/int(4))


if __name__=='__main__':
    path = os.getcwd()
    dir_files = os.listdir(path)
    list_files = []
    for file in dir_files:
        if file.endswith("fastq.gz"):
            if file not in list_files:
                file_content = open(file, "r").readlines() 
                list_files.append(file)
                print("This is the filename: ", file, get_num_bases(file_content), get_num_reads(file_content))
python count_reads count_bases • 972 views
ADD COMMENT
1
Entering edit mode

Use seqkit stats. It's easy to get stats on each fastq file. Biopython libraries might be there for fastq stats. Use them.

ADD REPLY

Login before adding your answer.

Traffic: 2035 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6