-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Numpy-isation" of irradiance decomposition functions #1455
Comments
Hi @alexisstgc thanks for bringing this up. It's a tricky and important problem.
The linked PR currently only shows a single line changed with a comment.
A small number of pvlib functions return OrderedDict if the user provides higher-dimensional input. The idea was that results could still be accessed through key lookup ( I've gone back and forth on whether I prefer labeled results or just a tuple. There was an issue with some discussion of this a year or two ago, but I don't recall where.
I am reluctant to endorse xarray because I worry that another user is going to come along and ask for support for something else (maybe something that hasn't been invented yet!). I'd prefer to maintain type consistency with inputs whenever possible.
I am generally in favor of pvlib functions accepting N-dimensional data, broadcasting as appropriate, and outputting something sensible. There was also some related discussion regarding solar position calculations here Unidata/MetPy#1440 (comment) |
It is a bit clunky, but the general task of applying a function only vectorized across time to data with other dimensions (e.g. spatial) can often be achieved by reshaping the data into a very long 1-D representation, running the function, and then reshaping the result back to N-D.
Maybe #959 |
Hello @wholmgren and @kanderso-nrel ! Thank you for your kind replies. I forgot to mention that this is my first time participating on a public library, don't hesitate if you notice I do something wrong or if I forget something 🤷♂️ 😅
Yes, I pushed my changes just now, but I have still some work to do to pass the tests and the code quality 😉
Of course, how could I not think of this... The only problem I foresee is the times DateTimeIndex you have to provide with the multidimensional GHI data... But I will try as soon as I can. |
BTW, in my PR I modified some additional points and I would also like some feedback on these:
|
I had a couple of thoughts about this topic:
|
Wondering what this would look like in practice... Let's start with the simple case of pvlib-python/pvlib/irradiance.py Lines 1400 to 1423 in 5047b26
I think the OP's problem is well-posed. And this is not the first time we've seen users struggle against multi-dimensional input with pvlib functions. The array/Series changes in the linked PR seemed reasonable to me, though I didn't give it a full review. As @kanderso-nrel said, reshaping numpy arrays to work with pvlib functions is clunky - maybe workable in isolation, but probably not if you want to stitch steps together. |
Many core functions might have to be refactored. Maybe even
I was hoping that this might be true, by some dumb luck and good architecture, but we'll see.
No, a pure NumPy version of
Yes, since Here's a quick cheat to broadcast the output into a structured array: dtype = np.dtype([('dni', np.float64), ('kt', np.float64), ('airmass', np.float64)])
shape = dni.shape
output = np.empty(shape, dt) # OrderedDict()
output['dni'] = dni
output['kt'] = kt
output['airmass'] = am works as long as |
Good initiative and ideas. I like the idea of core numpy-only functions. |
The structured array documentation seems to suggest this is a bad fit for our use case
I've not used structured arrays except for tinkering, so I don't have any personal experience to add to that. I am still not understanding what is the API for the users that want a DataFrame with columns |
Haha 🤣 I must be missing something totally obvious. I use struct arrays all the time, and I believe they are by far the fastest. See this gist. By my test, structured arrays are at least 200x faster than xarray and 13x faster than pandas. Also as in my quick cheat example above, broadcasting becomes trivial. For very simple struct arrays, I don't think the syntax is that hard, you just need to define the The API would be something like:
|
I was surprised to find that I already wrote 2D array tests for clearness index. Airmass functions are tested with 1D numpy array. In [43]: pvlib.irradiance.disc(np.array([[0, 250], [500, 750]], dtype=np.float32), np.array([[100, 70], [40, 20]], dtyp
...: e=np.float32), np.array([[100, 100], [100, 100]], dtype=np.float32))
Out[43]:
OrderedDict([('dni',
array([[ 0. , 414.87658824],
[165.11445531, 307.28344503]])),
('kt',
array([[0. , 0.53562295],
[0.47828513, 0.58485246]], dtype=float32)),
('airmass',
array([[ nan, 2.89994597],
[1.30367947, 1.0634042 ]]))]) why is float32 preserved for kt but not the others??? Turns out that the type promotion happens in In [64]: pvlib.irradiance.disc(np.array([[0, 250], [500, 750]], dtype=np.float32), np.array([[100, 70], [40, 20]], dtyp
...: e=np.float32), np.array([[100, 100], [100, 100]], dtype=np.float32), pressure=101325.)
Out[64]:
OrderedDict([('dni',
array([[ 0. , 414.87656],
[165.1144 , 307.28342]], dtype=float32)),
('kt',
array([[0. , 0.53562295],
[0.47828513, 0.58485246]], dtype=float32)),
('airmass',
array([[ nan, 2.899946 ],
[1.3036796, 1.0634042]], dtype=float32))]) Ok, now I want to use dask arrays... In [21]: out_da = irradiance.disc(da.from_array(ghi), da.from_array(solar_zenith), da.from_array(doy), pressure=101325.)
In [22]: out_da
Out[22]:
OrderedDict([('dni',
dask.array<where, shape=(2, 2), dtype=float32, chunksize=(2, 2), chunktype=numpy.ndarray>),
('kt',
dask.array<minimum, shape=(2, 2), dtype=float32, chunksize=(2, 2), chunktype=numpy.ndarray>),
('airmass',
dask.array<minimum, shape=(2, 2), dtype=float32, chunksize=(2, 2), chunktype=numpy.ndarray>)])
In [23]: out_da['dni'].compute()
Out[23]:
array([[ 0. , 414.87656],
[165.1144 , 307.28342]], dtype=float32) It just works! What if we used a structured array instead of a dict? Following @mikofski's example of a structured array... In [43]: output = np.empty(shape, dtype)
In [44]: output['dni'] = out_da['dni']
In [45]: output['kt'] = out_da['kt']
In [46]: output['airmass'] = out_da['airmass']
In [47]: output
Out[47]:
array([[( 0. , 0. , nan),
(414.8765564 , 0.53562295, 2.89994597)],
[(165.11439514, 0.47828513, 1.30367959),
(307.28341675, 0.58485246, 1.0634042 )]],
dtype=[('dni', '<f8'), ('kt', '<f8'), ('airmass', '<f8')]) It appears that So at least for this function, we mostly have a documentation and testing problem. Not sure what to do about the API but seems a little silly to copy all those parameters and all that documentation so that we can be explicit about "function |
This is great! Good architecture always pays off with serendipity!
I'm not sure I follow what this means. What is import numpy as np
import pvlib as pvl
ghi = np.clip(np.exp(np.random.randn(10000,3))*100, a_max=1000, a_min=0)
ze = np.clip(np.exp(np.random.randn(10000,3))*10, a_max=80, a_min=0)
doy = np.random.randint(364, size=(10000,3))
x = pvl.irradiance.disc(ghi, ze, doy)
y = np.empty((10000,3), [('dni', float), ('kt', float), ('airmass', float)])
y['dni'] = x['dni']
y['kt'] = x['kt']
y['airmass'] = x['airmass']
%timeit x['dni']
# 39.6 ns ± 6.55 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit y['dni']
# 93.6 ns ± 1.8 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) But can an ordered dict... # do this
y[['dni', 'kt']] # all dni and kt
# or this
y[:100, 2] # all fields, but only column 2 up to row 100
# or this?
pd.DataFrame(y[:, 1]) # make a dataframe from all fields for column 1 Anyway, like I said, I don't really see the need for structured arrays unless switching to a pure numpy function that doesn't use anything else, only numpy. |
Do I understand correctly that if there is a single return "value", then there is agreement that returning a normal numpy array is fine? And the primary motivator for the structured arrays is in the case of multiple return "values", so that they can be accessed by name? I would be in favor of using tuples and forgetting about the names. |
Not quite. I am interpreting the request for "numpy-isation" as a request for "de-pandas-isation" and for array duck typing. Maybe I am not listening well! Anyways, I am advocating for core functions that respect the user's choice of input type. So in the case of a single return value, if I provide say, dask array in then I'll get a dask array out. Same for xarray or cupy. Or pandas (but not explicitly bundled)! Core functions that strictly return numpy would be great for pure numpy users but I think we can do better. The main thing is that core pvlib should not attempt to bundle things up into pandas objects (or anything else). The The array function protocol is not perfect (NEP-37) but I think if we write non-opinionated functions then more and more things will just work over time. More recent developments:
I am coming around to this. Lots of numpy and scipy functions return tuples of arrays. My hesitancy is how to make this work for most of core pvlib without doubling the scope of the core API or going through a painful deprecation process.
Like I said above, maybe I am not listening. Do you really want functions that always return numpy no matter what? Or do you only care that functions 1. work with numpy inputs and 2. return a tuple/dict of numpy outputs? |
I'd have to of time researching python topics to understand everything discussed and mentioned in this issue, so I am not really qualified to comment, but anyway... I do like it when a function that receives Series arguments returns a Series, but typically I'm building a DataFrame as I go and assigning an array to a column works just as well. |
You've opened my eyes. I just want
Up to now I've been assuming numpy was the only way to get all 3 of these features, but now I realize there could be other backend tools that could handle these dispatches even better. I'm all for that.
I'm okay with either. I'm not married to numpy structured arrays. Maybe I was wrong about their use, everyone seems to keep burying them now that other tools are here. I thought they work faster, but maybe not in the way that's most needed or it might not be as important. For grabbing a field of data, seems like a tuple or dictionary is just as good, and more generic. Some APIs automatically return numpy structured arrays, like HDF5 for example. Also pandas has a |
@mikofski Beta was also better than VHS. :) |
Problem
I generate GHI data in maps and I want to run it through irradiance decomposition models (specifically
dirindex
) in order to obtain DNI and DHI data. My data is in three dimensions (time, latitude, longitude) so passing it through pvlib's built-in functions requiringpandas.DataFrame
isn't easy nor very efficient.Solution
I already adapted all necessary functions to accept numpy arrays of any dimension, just with the time dimension last (see the linked PR). Results are coherent with the pvlib's functions output and the computation is extremely fast compared to the alternative below. However, I am looking for some advices :
Alternatives I've considered
I tried looping over all the lat/lon points and calling
dirindex
for each of them, with multiprocessing but it took too long/ was too computationally heavy for the volume of data I was usingThe text was updated successfully, but these errors were encountered: