Question

Strange error when using gzip.open with a VCF file

0

Entering edit mode

2.8 years ago

ManuelDB ▴ 110

This code work perfectly

  def read_vcf(file_path):
        with open(file_path, 'r') as f:
            lines = [l for l in f if not l.startswith('##')]
        return pd.read_csv(
            io.StringIO(''.join(lines)),
            dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
        ).rename(columns={'#CHROM': 'CHROM'})

However, after reading how to open gzip files I have found this

   def read_vcf(file_path):
        with io.TextIOWrapper(gzip.open(file_path, 'r')) as f:
            lines = [l for l in f if not l.startswith('##')]
        return pd.read_csv(
            io.StringIO(''.join(lines)),
            dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
        ).rename(columns={'#CHROM': 'CHROM'})

io.TextIOWrapper is needed based on this post to avoid the error I got as in the post

But now I have got an error I don't understand

enter image description here

I have printed the results of the function and both results look the same

Why this error??

vcf gzip pandas • 1.7k views

ADD COMMENT • link updated 2.6 years ago by IkramInf ▴ 20 • written 2.8 years ago by ManuelDB ▴ 110

0

Entering edit mode

hmm not sure at all, but I never need the TextIOWrapper stuff and use gzip.open(args.vcf, 'rt') for reading as text

ADD REPLY • link 2.8 years ago by WouterDeCoster 47k

0

Entering edit mode

what are the contents of line 2654, please?

ADD REPLY • link 2.8 years ago by LauferVA 4.5k

0

Entering edit mode

2.6 years ago

IkramInf ▴ 20

Use mode 'rt' instead of 'r' in gzip.open(file_path, 'r'). Hope it will be helpful for you.

ADD COMMENT • link 2.6 years ago by IkramInf ▴ 20

score 2 · Accepted Answer · 2022-02-10

If you are looking for a way to import a VCF file, compressed or uncompressed, into a pandas.DataFrame object, you don't need to reinvent the wheel! Check out the pyvcf submodule I wrote:

$ cat example.vcf 
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Steven
chr1    100 .   G   A   .   .   .   GT  0/1
chr1    101 .   T   C   .   .   .   GT  0/1
chr1    102 .   A   T   .   .   .   GT  0/1

>>> from fuc import pyvcf
>>> vf = pyvcf.VcfFrame.from_file('example.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr1  101  .   T   C    .      .    .     GT    0/1
2  chr1  102  .   A   T    .      .    .     GT    0/1