-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for reading and writing avro files #367
base: develop
Are you sure you want to change the base?
Conversation
This pull request introduces 2 alerts when merging 078fccc into 274058d - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging d62465a into 274058d - view on LGTM.com new alerts:
|
Hello, This should be an optional dependency, I don't think 90% of users want to install this dep ifthey are not going to use it. Thanks a lot for your contribution, what's the status of the code ? Should I test that already ? |
This pull request introduces 1 alert when merging adc4431 into 274058d - view on LGTM.com new alerts:
|
Regarding status of code:
Regarding optional dependency:
If you want testing, just What you think about? Thanks |
Bonobo uses python's type system, which does not allow implicit conversions.
Not sure what "code" you're talking about, this is not far from what bonobo does (see NodeExecutionContext). It seems also that the tests should do a bit more, you need to "assert" things so pytest will actually check something. I'll have a try as soon as possible. Thanks ! |
Let me explain in more details what I tried to tackle. For writing to a avro file one should define a schema like this: schema = {
'doc': 'A weather reading.',
'name': 'Weather',
'namespace': 'test',
'type': 'record',
'fields': [
{'name': 'station', 'type': 'string'},
{'name': 'day', 'type': 'int', 'logicalType': 'date'},
{'name': 'time', 'type': 'long', 'logicalType': 'time-micros'},
{'name': 'temp', 'type': 'int'},
{'name': 'umidity', 'type': 'bytes', 'logicalType': 'decimal', 'precision': 4, 'scale': 2},
],
} The workflow that I was trying to hadle is the most common in ETL processing:
However, in the step 3, there wasn't type information for creating the required type schema. The problem happens because the python's type system could be not enough to represent the richness of types of a RDMS database:
Using the current values types give by bonobo I got working the following types:
But I suffered with the following types:
I tried to solve this by two ways:
I'm looking for the best way to handle this issue. |
I agree with that. Thanks |
For reproducing the issue I did the following:
Notice that most types mapped to int, float, string, bytes, and datetime.datetime create table testing (
fboolean BOOLEAN,
fchar CHAR(10),
fvarchar VARCHAR(10),
ftext TEXT(10),
ftinyint TINYINT,
fsmallint SMALLINT,
fmediumint MEDIUMINT,
fint INT,
fbigint BIGINT,
fdecimal DECIMAL(9,2),
ffloat FLOAT,
fdouble DOUBLE,
fbit BIT,
fdate DATE,
ftime TIME,
fdatetime DATETIME,
ftimestamp TIMESTAMP,
fyear YEAR
);
insert into testing values(
TRUE,
'abcdef', 'ghijkl', 'mnopqr',
1, 123, 32000, 66000, 1234567890,
123.456, 456.789, 123.789, 1,
'2019-12-25', '21:22:23', '2019-12-25 21:22:23', '2019-10-25 17:22:23',
2019
);
insert into testing values(
false,
'vidi', 'vini', 'vinci',
2, 121, 32023, 66066, 9876543210,
234.567, 567.890, 234.890, 0,
'2019-12-15', '15:22:23', '2019-12-15 16:22:23', '2019-10-15 17:15:23',
2018
); |
I understand your point. It should be possible to use different types than builtins, like for example one could use decimals (https://docs.python.org/3/library/decimal.html) to avoid wrong numbers on payckeck or numpy types to have rightly sized variables. There are two ways to do so and I think (but I may be wrong) it's not bonobo job to handle this (or at least, not more than providing a way to let the user do it. Either your data producer already knows how to produce those types as an output (a db driver that would yield numpy integers, for example). In that case, job's already done, and bonobo will just pass those values through. Either your data producer produces other types (assuming they do not contain unwantable approximations) and you can have a node in charge of casting things. This is of course less effective, but may still work in certain situations as it will free up memory waste for further processing, and there should be a limited amount of rows waiting to be converted. This is already something you can do in a node. So as I see it (but let me know if I'm wrong, you may have thought more of this), there is one "correct" way which is the responsibility of whatever talks with the data source, and one workaround which is possible. Am I missing something ? Or are you suggesting that you would need some metadata information storage about columns ? |
Thanks for the summarizing the implications around the my point. After reading that, I've found some new things about the issues:
|
Regarding the metadata information storage about columns, I am thinking that the importance of preserving the column types depends on the type of output and the level of effort and complexity of the development and use. So the source column type information will/wont matter according the use case/scenario. For instance considering only the output/destination:
For instance considering only the effort and complexity of the translation:
The most common ETL use cases I now are:
Basically we have a combination between:
Considering all this it's possible to have some decisions like:
One could think that this solutions:
What you think about? |
Hey. Thanks for the detailed analysis/explanation. From what I understand, I think that all use cases are already handled by ... python type system :) Let's say we have int16 and int32 types understood by some consumer. Then the only need is to have some type (as in python type) that contains enough metadata for the consumer to output the correct type. There are two things that are not done by bonobo (but I pretty much think it's not its responsibility, although you're welcome to prove me wrong) :
Do you have concrete cases that cannot be handled this way? Also, I think you should focus on avro-only test cases, as if we are able to produce whatever the avro-related nodes expect and we ensure the said nodes are indeed working correctly, it does not matter to know what kind of node (or graph) produced the data. Not sure this is something you tried to do in your tests but as you're describing cases using remote databases, I prefer to state this, sorry if it's useless. Sorry I still did not find the time to actually test what your code does but I'm on it as soon as possible, if you focus the merge request on having avro readers/writers using optional dependency (and with tests :p), I think we can integrate it pretty soon. If you think that from the discussion another topic is worth considering, maybe we should open specific issues to discuss it ? Thanks |
Hi, Regarding focus I planning to continue in the following way:
Also I need some help regarding the best way of:
Resuming the discussion, regarding type mapping I pretty much agree with your conclusions:
What I'm think is worth exploring in bonobo for the type mapping is a simpler solution like:
For instance, if I have a table in MySql created like this: create table wrong_types (
ftinyint TINYINT,
fsmallint SMALLINT,
fmediumint MEDIUMINT,
fint INT,
fdecimal DECIMAL(9,2),
ffloat FLOAT,
fbit BIT
); Today without extra type information besides python types it's only possible to create a schema like: schema = {
'fields': [
{'name': 'ftinyint', 'type': 'long'},
{'name': 'fsmallint', 'type': 'long'},
{'name': 'fmediumint', 'type': 'long'},
{'name': 'fint', 'type': 'long'},
{'name': 'fdecimal', 'type': 'double'},
{'name': 'ffloat', 'type': 'double'},
{'name': 'fbit', 'type': 'bytes'},
],
...
} But knowing the type information one could create a better, smaller and faster schema like: schema = {
'fields': [
{'name': 'ftinyint', 'type': 'int'},
{'name': 'fsmallint', 'type': 'int'},
{'name': 'fmediumint', 'type': 'int'},
{'name': 'fint', 'type': 'int'},
{'name': 'fdecimal', 'type': 'bytes', 'logicalType': 'decimal', 'precision': 9, 'scale': 2},
{'name': 'ffloat', 'type': 'float'},
{'name': 'fbit', 'type': 'boolean'},
],
...
} The biggest offensor there is the mapping of Should we continue this discussion in a separated issue? |
This pull request introduces 3 alerts when merging 145de5a into 274058d - view on LGTM.com new alerts:
|
This pull request introduces 3 alerts when merging 165575d into 274058d - view on LGTM.com new alerts:
|
Hi, Can you review this pull request? I think that I reached the point for starting the integration. thanks |
This pull request introduces 2 alerts when merging 3fd0c88 into 274058d - view on LGTM.com new alerts:
|
Any news about this PR? |
Add support for reading and writing Avro files using FastAvro.
Avro is faster and safer than other format as CSV, JSON or XML.
As Avro is typed, the fields types are detected from values. Once bonobo starts preserving types, they could be used for determining field types.
Tested with the workflow mysql -> sqlalchemy -> bonobo -> avro.
Publishing now for gattering sugestions.