Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected unicode, got quoted_name #810

Open
TrentGlover opened this issue Sep 28, 2022 · 8 comments
Open

Expected unicode, got quoted_name #810

TrentGlover opened this issue Sep 28, 2022 · 8 comments

Comments

@TrentGlover
Copy link

TrentGlover commented Sep 28, 2022

Error returned when calling parquetobj.table_to_parquet(parq_file_name, row['schema_name'], row['table_name'] , sql, db_engine, batch_size)

error is

Traceback (most recent call last):
  File "C:\Users\a093789\AppData\Roaming\Python\Python310\site-packages\fastparquet\writer.py", line 1499, in write_thrift
    return f.write(obj.to_bytes())
TypeError: Expected unicode, got quoted_name
TypeError: Expected unicode, got quoted_name
Exception ignored in: 'fastparquet.cencoding.write_list'

Code:

from tempfile import NamedTemporaryFile
import pyodbc, fastparquet
from datetime import date
import pandas as pd
from sqlalchemy import Table, create_engine, schema

if name == 'main':

sender = '[email protected]'
receivers = ['[email protected]', '[email protected]']
port = 25

constr = "DRIVER={SQL Server Native Client 11.0};SERVER=bi-prtr;DATABASE=archive;Trusted_Connection=yes;"

parqfilepath = 'C:\\appl\Python\Data\Parquet\\'
parqfilename = 'raw-cna-mwia-archive-archive_{}-interval_{}_{}_dumpedon_{}.parquet'
parqerrorfilepath = '\\\\cake\\DWBackup\EAP-Archive\\data\\parquet\\'
batch_size = 1000000

tempfile = NamedTemporaryFile(mode='w', delete=False, newline='', dir='./')
number_of_rows = 0

continue_processing: bool = True
todays_date = date.today()

line_count = 1
from_date = date.today()
to_date = date.today()

try:
    conn = pyodbc.connect(constr)
except:
    print("{c} is not working".format(c=constr))

cursor = conn.cursor()
db_engine = create_engine('mssql+pyodbc:///?odbc_connect=' + constr)
sql = 'select * from archive.Ibs.PHOALD WITH (NOLOCK) Where Convert(date, Transaction_time) > Convert(date, \'2019-05-20\') and Convert(date, Transaction_time) <= Convert(date, \'2019-06-20\')'
print(f"{sql}")

parq_file_name = 'testfile.parquet'

db_metadata = schema.MetaData(bind=db_engine, schema='IBS')

db_table = Table('PHOALD', db_metadata, autoload=True)

column_dict = dict()
for column in db_table.columns:

     column_dict[column.name] = str

result = db_engine.execute(sql)

row_batch = result.fetchmany(size=batch_size)
iteration = 1
append = False

while (len(row_batch) > 0):

     b_df = pd.DataFrame(row_batch, columns=column_dict, dtype=str)

     b_df = b_df.astype(dtype=column_dict)

     if not b_df.empty:
         fastparquet.write(parqfilename, b_df, write_index=False, compression='GZIP', append=append)
         append = True
         row_batch = result.fetchmany(size=batch_size)
         iteration = iteration + 1
  • Dask version: 2022.9.1
  • Python version: 3.10
  • Operating System: Windows 10 Enterprise version 10.0.19041 Build 19041
  • Install method (conda, pip, source): pip
@martindurant
Copy link
Member

That's a lot of code and stuff I'm not familiar with, it would be really useful to try to pare down the reproducer to the minimum required to show the problem.

At a complete guess columns=column_dict in pd.DataFrame() is passing some non-string items derived from the database stuff. You may need explicit str() around these.

@TrentGlover
Copy link
Author

TrentGlover commented Sep 29, 2022 via email

@TrentGlover
Copy link
Author

same error when attempting to turn all columns into strings...

data_query: str = "Select {} from archive.Ibs.PHOALD WITH (NOLOCK) Where Convert(date, Transaction_time) > Convert("
"date, '2019-05-20') and Convert(date, Transaction_time) <= Convert(date, '2019-06-20') "

...

cursor = conn.cursor()
metadata_sql = 'select top 1 * from archive.Ibs.PHOALD WITH (NOLOCK)'
print(metadata_sql)
cursor.execute(metadata_sql)
column_names = ''
field_names = [i[0] for i in cursor.description]
for j in field_names:
    j = 'convert(nvarchar, ' + j + ') as ' + j
    column_names = column_names + ',' + j
column_names = column_names[1:]
print(f"{column_names}")

sql = data_query.format(column_names)
print(f"{sql}")

@martindurant
Copy link
Member

A "quoted_name" is entering into your dataframe metadata somewhere, so the task is to find out where. Maybe fastparquet should be converting it to a string internally, but I would bet that requiring a real string is a reasonable expectation in this case.

I'm afraid I don't really know what your code it doing, I don't normally work with DB cursor objects. You might find it easier to use pd.from_sql_query instead, as I suspect it might do the correct conversions for you.

@TrentGlover
Copy link
Author

TrentGlover commented Sep 30, 2022

the problem definitely looks like it is native to the fastparquet code. i've called it without modification and it is still throwing the error.

@martindurant
Copy link
Member

i've called it without modification and it is still throwing the error.

What did you call exactly? Did you manage to make a dataframe showing this problem without SQL?

@TrentGlover
Copy link
Author

TrentGlover commented Oct 3, 2022 via email

@martindurant
Copy link
Member

I mean:

     b_df = pd.DataFrame(row_batch, columns=column_dict, dtype=str)
     b_df = b_df.astype(dtype=column_dict)

here both the original columns and the ones supplied in dtype= have keys that came from the SQL driver, and probably responsible for the "quoted_name" in the error. Try converting them to actual str before doing anything with them:

columns = {str(k): v for k, v in columns_dict.items()}

Are the dtypes really all str?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants