Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ffill and bfill operations while remaining sparse #285

Closed
p-d-moore opened this issue Sep 4, 2019 · 7 comments
Closed

Support ffill and bfill operations while remaining sparse #285

p-d-moore opened this issue Sep 4, 2019 · 7 comments
Labels
enhancement Indicates new feature requests

Comments

@p-d-moore
Copy link

I would like to add a request for sparse: Support ffill and bfill operations while maintaining the sparse level of data density.

The challenge to overcome is that performing ffill operations on sparse data quickly creates data that is no longer "sparse" in practice and makes dealing with the data challenging.

My suggested implementation (and the way I have previously done this in another programming environment) is to represent the data as rows of contiguous regions with a single (non-sparse) value rather than rows of single points. That is, the data then is represented as a list of values + coordinate ranges rather than a list of values + coordinates. This request might make more sense in the particular context of the sparse value being NaN.

The idea is that you can easily compute operations like ffill without changing the sparsity of the matrix, and thus support typical aggregating functions you might like to apply to the data before you collapse the data and convert to a non-sparse form (e.g. perform a lag difference or a cross-sectional mean). These types of operations can be more useful when the data is "fuller" such as after a forward fill, but often not useful when the data is very sparsely populated (as the cross-sectional operations are unlikely to hit the sparse data among the different dimensions).

Care must be taken to avoid "collisions" between sparse blocks of data, that is, avoiding that the list of sparse blocks accidentally overlap. The implementation can get tricky but I believe the goal to be worthwhile. It may be a large enough change to make it a separate class, at least initially.

@hameerabbasi
Copy link
Collaborator

Hi! Would making the "fill-value" (as we call it) a reverse-broadcasted version of the "dense" part help? This seems to be a very niche feature.

@hameerabbasi hameerabbasi added the enhancement Indicates new feature requests label Sep 4, 2019
@p-d-moore
Copy link
Author

Hi @hameerabbasi, I have to confess, I am not sure what you mean by a reverse-broadcasted version of the dense part? The request is related to this xarray discussion

The request is really to generalise the current Sparse class to represent data where (non fill-value) values are repeated consecutively. Such data often arises from ffill / fillna type operations.

I agree the feature is in danger of being somewhat niche, unless it finds wider support. The usage case is where we have sparse observations of some data which we want to aggregate along a given dimension. Because the data is sparse, it becomes difficult to aggregate unless a ffill / fillna type operation can be first applied, but performing these operations tends to lead to data that is no longer sparse and increases memory usage (the purpose of using sparse to begin with).

@hameerabbasi
Copy link
Collaborator

I believe in that case making broadcast_to a view would suit your needs, and be of more use generally.

@hameerabbasi
Copy link
Collaborator

Wait, I just read the documentation for ffill/fillna. This should be possible simply initially keeping the fill-value as NaN, and later simply changing the fill-value to what was suitable via (for example) (np.where(np.isnan(arr), value_to_replace_nan_with, arr)).

@p-d-moore
Copy link
Author

Sorry I mean ffill as opposed to fillna (a red herring there).

By ffill I mean in Pandas or xarray

That is, copying each point of non-NaN data forward along a given dimension (or down rows of a dataframe) a set number of times or unless it collides with another datapoint. The data now consists of regions with the same value repeated, each region may take on a different value.

@hameerabbasi
Copy link
Collaborator

This is a lot more niche than I initially assumed... I really doubt this is in scope here.

@p-d-moore
Copy link
Author

What I want to do might not fit well in this project on further consideration. I would like to replicate something I built in the past but this is probably takes on a different more specialised format than the goal of sparse is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Indicates new feature requests
Projects
None yet
Development

No branches or pull requests

2 participants