Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not crash when locations are null(!) #14

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AmlingPalantir
Copy link
Contributor

We have observed null locations (specifically write, flush, and replay) in the field crashing manatee-adm pg-status.

@davepacheco
Copy link
Contributor

Thanks for this. Under what conditions are the locations null though?

@AmlingPalantir
Copy link
Contributor Author

Let's see, IIRC this was right after I had restarted the sitter after rebuilding a deposed primary. For some time the row for sync (from which the rebuilt deposed primary would soon be replicating) would show "0/0" under "SENT" and then crash. Inspection of internals showed WRITE, FLUSH, and REPLAY were all null. My guess is this was as the new box was starting up replication, or something. Sorry I don't really have much but I am able to reproduce this very easily (every failover) if you need to know more.

@davepacheco
Copy link
Contributor

@AmlingPalantir Given that it's easy to reproduce, could you post the output that came back from Postgres? If it were me, I'd just run with --abort-on-uncaught-exception and look at the actual value in the core file when it crashes. If it's easier, maybe you could console.log() it instead?

If it's not clear, the context is that I don't like to make changes that just assume any arbitrary value can be null. I'd rather have a concrete understanding of exactly what it means when various fields are null, and make sure that's documented and dealt with everywhere it matters.

@AmlingPalantir
Copy link
Contributor Author

Whoops, here's the row from pg_stat_replication that it is barfing on:

{ pid: 61060,
  usesysid: 10,
  usename: 'palantir',
  application_name: 'pg_basebackup',
  client_addr: '<ip>',
  client_hostname: null,
  client_port: 47944,
  backend_start: Tue Nov 17 2015 15:36:08 GMT-0500 (EST),
  backend_xmin: null,
  state: 'backup',
  sent_location: '0/0',
  write_location: null,
  flush_location: null,
  replay_location: null,
  sync_priority: 0,
  sync_state: 'async' }

Our fork uses pg_basebackup in place of ZFS snapshots and this looks to be
[more] fallout from that.

@davepacheco
Copy link
Contributor

Given that context, I'm not sure it's sufficient to just set the locations to "-". Wouldn't it make more sense to label the replication kind as "pgbackup" or something to indicate that pg_basebackup is being used? Between that and talking with David yesterday, I wonder if it doesn't make more sense to fold this into the other changes you all have and get them all upstreamed together. Let me know what you think. If for whatever reason it would be helpful to get this change in sooner or separately, we should figure out how we actually want this tool to behave on this kind of replication row. I don't think presenting the locations as "-" is quite right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants