-
-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ffill and bfill operations while remaining sparse #285
Comments
Hi! Would making the "fill-value" (as we call it) a reverse-broadcasted version of the "dense" part help? This seems to be a very niche feature. |
Hi @hameerabbasi, I have to confess, I am not sure what you mean by a reverse-broadcasted version of the dense part? The request is related to this xarray discussion The request is really to generalise the current Sparse class to represent data where (non fill-value) values are repeated consecutively. Such data often arises from ffill / fillna type operations. I agree the feature is in danger of being somewhat niche, unless it finds wider support. The usage case is where we have sparse observations of some data which we want to aggregate along a given dimension. Because the data is sparse, it becomes difficult to aggregate unless a ffill / fillna type operation can be first applied, but performing these operations tends to lead to data that is no longer sparse and increases memory usage (the purpose of using sparse to begin with). |
I believe in that case making |
Wait, I just read the documentation for |
Sorry I mean ffill as opposed to fillna (a red herring there). By ffill I mean in Pandas or xarray That is, copying each point of non-NaN data forward along a given dimension (or down rows of a dataframe) a set number of times or unless it collides with another datapoint. The data now consists of regions with the same value repeated, each region may take on a different value. |
This is a lot more niche than I initially assumed... I really doubt this is in scope here. |
What I want to do might not fit well in this project on further consideration. I would like to replicate something I built in the past but this is probably takes on a different more specialised format than the goal of sparse is. |
I would like to add a request for sparse: Support ffill and bfill operations while maintaining the sparse level of data density.
The challenge to overcome is that performing ffill operations on sparse data quickly creates data that is no longer "sparse" in practice and makes dealing with the data challenging.
My suggested implementation (and the way I have previously done this in another programming environment) is to represent the data as rows of contiguous regions with a single (non-sparse) value rather than rows of single points. That is, the data then is represented as a list of values + coordinate ranges rather than a list of values + coordinates. This request might make more sense in the particular context of the sparse value being NaN.
The idea is that you can easily compute operations like ffill without changing the sparsity of the matrix, and thus support typical aggregating functions you might like to apply to the data before you collapse the data and convert to a non-sparse form (e.g. perform a lag difference or a cross-sectional mean). These types of operations can be more useful when the data is "fuller" such as after a forward fill, but often not useful when the data is very sparsely populated (as the cross-sectional operations are unlikely to hit the sparse data among the different dimensions).
Care must be taken to avoid "collisions" between sparse blocks of data, that is, avoiding that the list of sparse blocks accidentally overlap. The implementation can get tricky but I believe the goal to be worthwhile. It may be a large enough change to make it a separate class, at least initially.
The text was updated successfully, but these errors were encountered: