Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOFError when multi-processing and orbit is variable #308

Open
eas342 opened this issue Apr 28, 2023 · 5 comments
Open

EOFError when multi-processing and orbit is variable #308

eas342 opened this issue Apr 28, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@eas342
Copy link
Contributor

eas342 commented Apr 28, 2023

Describe the bug
I had an issue with sampling in starry on multiple cores when the orbit is variable. This had me running everything more slowly on one core for a while. The error traceback I got was:

Error traceback --------------------------------------------------------------------------- EOFError Traceback (most recent call last) Input In [24], in () 1 with model: 2 #trace = pmx.sample( ----> 3 trace = pm.sample( 4 tune=250, 5 draws=500, 6 start=map_soln, 7 chains=4, 8 cores=4, 9 target_accept=0.9, 10 )

File ~/miniconda3/envs/ABATE/lib/python3.9/site-packages/pymc3/sampling.py:559, in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, pickle_backend, **kwargs)
557 _print_step_hierarchy(step)
558 try:
--> 559 trace = _mp_sample(**sample_args, **parallel_args)
560 except pickle.PickleError:
561 _log.warning("Could not pickle model, sampling singlethreaded.")

File ~/miniconda3/envs/ABATE/lib/python3.9/site-packages/pymc3/sampling.py:1477, in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, callback, discard_tuned_samples, mp_ctx, pickle_backend, **kwargs)
1475 try:
1476 with sampler:
-> 1477 for draw in sampler:
1478 trace = traces[draw.chain - chain]
1479 if trace.supports_sampler_stats and draw.stats is not None:

File ~/miniconda3/envs/ABATE/lib/python3.9/site-packages/pymc3/parallel_sampling.py:479, in ParallelSampler.iter(self)
476 self._progress.update(self._total_draws)
478 while self._active:
--> 479 draw = ProcessAdapter.recv_draw(self._active)
480 proc, is_last, draw, tuning, stats, warns = draw
481 self._total_draws += 1

File ~/miniconda3/envs/ABATE/lib/python3.9/site-packages/pymc3/parallel_sampling.py:351, in ProcessAdapter.recv_draw(processes, timeout)
349 idxs = {id(proc._msg_pipe): proc for proc in processes}
350 proc = idxs[id(ready[0])]
--> 351 msg = ready[0].recv()
353 if msg[0] == "error":
354 warns, old_error = msg[1:]

File ~/miniconda3/envs/ABATE/lib/python3.9/multiprocessing/connection.py:255, in _ConnectionBase.recv(self)
253 self._check_closed()
254 self._check_readable()
--> 255 buf = self._recv_bytes()
256 return _ForkingPickler.loads(buf.getbuffer())

File ~/miniconda3/envs/ABATE/lib/python3.9/multiprocessing/connection.py:419, in Connection._recv_bytes(self, maxsize)
418 def _recv_bytes(self, maxsize=None):
--> 419 buf = self._recv(4)
420 size, = struct.unpack("!i", buf.getvalue())
421 if size == -1:

File ~/miniconda3/envs/ABATE/lib/python3.9/multiprocessing/connection.py:388, in Connection._recv(self, size, read)
386 if n == 0:
387 if remaining == size:
--> 388 raise EOFError
389 else:
390 raise OSError("got end of file during message")

EOFError:

To Reproduce

Minimal-ish example adapted from the "Hot jupiter phase curve example"
import starry
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import pymc3_ext as pmx
import exoplanet

starry.config.quiet = True
np.random.seed(1)


# In[7]:


A = starry.Primary(starry.Map(ydeg=0, udeg=2, amp=1.0), m=1.0, r=1.0, prot=1.0)
A.map[1] = 0.4
A.map[2] = 0.2


# In[8]:


# These are the parameters we're going to try to infer
log_amp_true = -3.0
offset_true = 30.0

b = starry.Secondary(
    starry.Map(ydeg=1, udeg=0, amp=10 ** log_amp_true, inc=90.0, obl=0.0),
    m=0.0,
    r=0.1,
    inc=90.0,
    prot=1.0,
    porb=1.0,
)
b.map[1, 0] = 0.5
b.theta0 = 180.0 + offset_true


# In[9]:


sys = starry.System(A, b)


# In[10]:


t = np.linspace(-0.3, 1.3, 1000)
flux_true = sys.flux(t).eval()
ferr = 1e-4
flux = flux_true + ferr * np.random.randn(len(t))
plt.figure(figsize=(12, 5))
plt.plot(t, flux, "k.", alpha=0.3, ms=3)
plt.plot(t, flux_true)
plt.xlabel("Time [days]", fontsize=24)
plt.ylabel("Flux [normalized]", fontsize=24);


# In[21]:


with pm.Model() as model:

    # These are the variables we're solving for;
    # here we're placing wide Gaussian priors on them.
    #offset = pm.Normal("offset", 0.0, 50.0, testval=0.11)
    offset=offset_true
    log_amp = pm.Normal("log_amp", -4.0, 2.0, testval=-3.91)

    porb = pm.Normal('porb',mu=1.0,sigma=0.02)
    #porb = 1.0

    # Instantiate the star; all its parameters are assumed
    # to be known exactly
    A = starry.Primary(
        starry.Map(ydeg=0, udeg=2, amp=1.0, inc=90.0, obl=0.0), m=1.0, r=1.0, prot=1.0
    )
    A.map[1] = 0.4
    A.map[2] = 0.2

    # Instantiate the planet. Everything is fixed except for
    # its luminosity and the hot spot offset.
    b = starry.Secondary(
        starry.Map(ydeg=1, udeg=0, amp=10 ** log_amp, inc=90.0, obl=0.0),
        m=0.0,
        r=0.1,
        prot=1.0,
        porb=porb,
    )
    b.map[1, 0] = 0.5
    b.theta0 = 180.0 + offset

    # Instantiate the system as before
    sys = starry.System(A, b)

    # Our model for the flux
    flux_model = pm.Deterministic("flux_model", sys.flux(t))

    # This is how we tell `pymc3` about our observations;
    # we are assuming they are ampally distributed about
    # the true model. This line effectively defines our
    # likelihood function.
    pm.Normal("obs", flux_model, sd=ferr, observed=flux)


# In[22]:


with model:
    map_soln = pmx.optimize()


# In[23]:


plt.figure(figsize=(12, 5))
plt.plot(t, flux, "k.", alpha=0.3, ms=3)
plt.plot(t, map_soln["flux_model"])
plt.xlabel("Time [days]", fontsize=24)
plt.ylabel("Flux [normalized]", fontsize=24);


# In[24]:


with model:
    trace = pm.sample(
        tune=250,
        draws=500,
        start=map_soln,
        chains=4,
        cores=4,
        target_accept=0.9,
    )

Expected behavior
Should sample the posterior and calculate a trace object. Instead, I get the error.

Your setup (please complete the following information):

  • Version of starry: 1.2.0
  • Operating system: Mac OS X 12.0.1
  • Python version & installation method (pip, conda, etc.): Python 3.9.12, pip 21.2.4, conda 4.10.1, M1pro chip

Additional context
This happened to me in a fairly specific set of circumstances:

  • when the number of cores > 1
  • when the orbital parameters are variable
    However, this is the set of circumstances in which I primarily use starry.

I have found a solution/workaround and wanted to share it in case anyone else gets EOFError.

@eas342 eas342 added the bug Something isn't working label Apr 28, 2023
@eas342
Copy link
Contributor Author

eas342 commented Apr 28, 2023

The error goes away if I change the sampling from pymc3 to pymc3_ext. I wish I'd noticed that sooner!

This gives an EOFError:

import pymc3_ext as pmx
import pymc3 as pm

trace = pm.sample(
    tune=250,
    draws=500,
    start=map_soln,
    chains=4,
    cores=4,
    target_accept=0.9,
)

This works with no problem. Notice the X in pmx!:

import pymc3_ext as pmx
import pymc3 as pm

trace = pmx.sample(
    tune=250,
    draws=500,
    start=map_soln,
    chains=4,
    cores=4,
    target_accept=0.9,
)

@dfm
Copy link
Collaborator

dfm commented Apr 28, 2023

Good find! I expect this has something to do with these multiprocessing hacks in pymc3-ext:

https://github.com/exoplanet-dev/pymc-ext/blob/ef9bc12b8b16bd0991af70f23c59f1e32d9944be/src/pymc3_ext/sampling/sampling.py#LL100C1-L107C30

PyMC3 has always had some serious issues with multiprocessing on Macs, and I never loved this "solution", but it typically seems to work!

@catrionamurray
Copy link

catrionamurray commented May 17, 2023

I get this bug persisting on an old Mac (Intel 2019 OSx13.3.1) even when using pymc3_ext.sample for >1 core. I believe this isn't an issue for my M1 chip Mac.

starry v1.2.0
Python v3.9.7 (default, Sep 16 2021, 08:50:36)
[Clang 10.0.0 ]
numpy v1.21.6
chromatic v0.4.5
pymc3 v3.11.4
pymc3_ext v0.1.1

@dfm
Copy link
Collaborator

dfm commented May 17, 2023

@catrionamurray — Bummer! I'm not too sure what to recommend and I don't have an Intel mac to test this locally. One option for the short term might be to just run multiple copies of your script (each with 1 CPU) and then combine the chains after... Unfortunately I'm about to go on leave and working at limited capacity so I can't be super helpful in the short term - sorry!!

@catrionamurray
Copy link

Actually it seems I still get this issue on my M1 chip, and changing to pmx.sample doesn't seem to solve it for me...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants