Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(repr): add show_count option to interactive repr #10518

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Nov 21, 2024

Resolves #10231

Doesn't have tests yet. I want to get feedback on the semantics before I do that. I assume CI is going to fail all over the place because all the docstrings etc are going to change.

CHANGELOG

  1. left-justified the dimensions per review suggestion

Currently looks like

import ibis

ibis.options.repr.interactive.max_rows = 3
ibis.options.repr.interactive.show_count
# True

t = ibis.memtable({"foo": range(1_000), "bar": range(1_000), "baz": range(1_000)})
t
# 3 cols by 1_000 rows
# ┏━━━━━━━┳━━━━━━━┳━━━━━━━┓
# ┃ foo   ┃ bar   ┃ baz   ┃
# ┡━━━━━━━╇━━━━━━━╇━━━━━━━┩
# │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┤
# │     0 │     0 │     0 │
# │     1 │     1 │     1 │
# │     2 │     2 │     2 │
# │     … │     … │     … │
# └───────┴───────┴───────┘
t.preview(show_count=False)
# 3 cols by … rows
# ┏━━━━━━━┳━━━━━━━┳━━━━━━━┓
# ┃ foo   ┃ bar   ┃ baz   ┃
# ┡━━━━━━━╇━━━━━━━╇━━━━━━━┩
# │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┤
# │     0 │     0 │     0 │
# │     1 │     1 │     1 │
# │     2 │     2 │     2 │
# │     … │     … │     … │
# └───────┴───────┴───────┘
t.foo
# 1_000
# rows
# ┏━━━━━━━┓
# ┃ foo   ┃
# ┡━━━━━━━┩
# │ int64 │
# ├───────┤
# │     0 │
# │     1 │
# │     2 │
# │     … │
# └───────┘
# t.foo.preview(show_count=False)
# … rows
# ┏━━━━━━━┓
# ┃ foo   ┃
# ┡━━━━━━━┩
# │ int64 │
# ├───────┤
# │     0 │
# │     1 │
# │     2 │
# │     … │
# └───────┘

I made the config default be show_count=True. I can change this with pushback, but this is what I would start with. I would love to brainstorm a few benchmark test cases to run to see the perf difference. Some ideas:

  1. ibis.duckdb.connect("mydb.db").table("my_table") (I expect ~0 difference)
  2. ibis.duckdb.connect().read_parquet("t.pq") (I expect~0 difference)
  3. ibis.duckdb.connect().read_csv("thousand_rows.csv") (I expect small difference)
  4. ibis.duckdb.connect().read_csv("billion_rows.csv") (I expect large difference)
  5. ibis.duckdb.connect().read_csv("thousand_rows.csv").some_expensive computation() (I expect a difference, size depends on semantics of the function)

I considered adding a .repr_options attribute to expressions as described in #10231 (comment), but I decided that was too complicated.

I considered showing the table name in the repr, eg with table.get_name(), but that is a separate question.

Currently, the option has the semantics of show_count: bool, and we ALWAYS show the column count. I considered other encodings such as show_shape: Literal["rows", "cols", "both", None], but I thought that was overkill.

I considered adding the row count to the bottom of the table, eg something like

┏━━━━━━━┓
┃ foo   ┃
┡━━━━━━━┩
│ int64 │
├───────┤
│     0 │
│     1 │
│     2 │
│     … │
│ 1_000 total rows│
└───────┘

but then if you set a high max_rows it would be hard to see. Plus then this info would need to be repeated in every column. Anyway, if you have other ideas on the graphic design of where to present the counts, I'm all ears.

@lboller-pwbm
Copy link

lboller-pwbm commented Nov 21, 2024

This looks fantastic!

I tried it out in a Jupyter kernel with a fairly wide table, which appears as a horizontally scrollable output using VSCode's interactive window. The dimensions appear but are center-justified, so the user has to scroll horizontally quite a bit to see them. I would suggest making it left-justified.

ibis/expr/types/pretty.py Outdated Show resolved Hide resolved
@@ -76,6 +79,7 @@ class Interactive(Config):
max_string: int = 80
max_depth: int = 1
show_types: bool = True
show_count: bool = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a nice feature to have. I do not think it should default to true.

Interactive mode is supposed to be interactive, and something that can be computationally expensive isn't something we want to turn on by default.

The main failure case I can think of here is reading from CSVs in a blob store. With show_count=False, this can be done with a range request and only download the first N rows (ish), except if there's an order-by. show_count=True would mean any operation would trigger a download of the entire file.

And yes, that is one more reason not to use CSVs, but 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I am OK with defaulting to False. Let's just wait for Philip to chime in to verify, but I assume he is going to have the same opinion as you.

I really don't think it is worth the complexity/cleverness, but throwing it out there that we could make each relation have its own config, and that could be auto-determined based on if it's a CSV/parquet/physical table, local/remote, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume he is going to have the same opinion as you.

We are slowly merging into the same person. The transformation is nearly complete.

I really don't think it is worth the complexity/cleverness, but throwing it out there that we could make each relation have its own config, and that could be auto-determined based on if it's a CSV/parquet/physical table, local/remote, etc.

I mean, it's definitely a reliably cheap(er) operation on anything that isn't a CSV, but I agree this seems like too much complexity. I think a better move is your suggestion of persistent default configuration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's leave the default to False, otherwise it's a huge performance footgun/bug magnet.

else:
nrows = "…"
if isinstance(tablish, ir.Table):
dims = f"{orig_ncols:_} cols by {nrows} rows"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this follow the long-standing convention of "rows by columns"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: show df.count(), len(df.schema()) when displaying interactive = True tables
5 participants