Strange error when using gzip.open with a VCF file
2
0
Entering edit mode
2.8 years ago
ManuelDB ▴ 110

This code work perfectly

  def read_vcf(file_path):
        with open(file_path, 'r') as f:
            lines = [l for l in f if not l.startswith('##')]
        return pd.read_csv(
            io.StringIO(''.join(lines)),
            dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
        ).rename(columns={'#CHROM': 'CHROM'})

However, after reading how to open gzip files I have found this

   def read_vcf(file_path):
        with io.TextIOWrapper(gzip.open(file_path, 'r')) as f:
            lines = [l for l in f if not l.startswith('##')]
        return pd.read_csv(
            io.StringIO(''.join(lines)),
            dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
        ).rename(columns={'#CHROM': 'CHROM'})

io.TextIOWrapper is needed based on this post to avoid the error I got as in the post

But now I have got an error I don't understand

enter image description here

I have printed the results of the function and both results look the same

Why this error??

vcf gzip pandas • 1.7k views
ADD COMMENT
0
Entering edit mode

hmm not sure at all, but I never need the TextIOWrapper stuff and use gzip.open(args.vcf, 'rt') for reading as text

ADD REPLY
0
Entering edit mode

what are the contents of line 2654, please?

ADD REPLY
2
Entering edit mode
2.8 years ago
sbstevenlee ▴ 480

If you are looking for a way to import a VCF file, compressed or uncompressed, into a pandas.DataFrame object, you don't need to reinvent the wheel! Check out the pyvcf submodule I wrote:

$ cat example.vcf 
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Steven
chr1    100 .   G   A   .   .   .   GT  0/1
chr1    101 .   T   C   .   .   .   GT  0/1
chr1    102 .   A   T   .   .   .   GT  0/1
>>> from fuc import pyvcf
>>> vf = pyvcf.VcfFrame.from_file('example.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1
2  chr1  102  .   A   T    .      .    .     GT    0/1
ADD COMMENT
0
Entering edit mode

Thanks for this. The thing is that I am working in the research env of the 100k genome project and I think (I haven't checked yet) that this module is not available. If so, I will try your suggestion.

ADD REPLY
0
Entering edit mode
2.6 years ago
IkramInf ▴ 20

Use mode 'rt' instead of 'r' in gzip.open(file_path, 'r'). Hope it will be helpful for you.

ADD COMMENT

Login before adding your answer.

Traffic: 1649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6