Make queries fast, filter all flexible attributes #5240

snejus · 2024-05-09T11:36:51Z

Description

Another and (hopefully) final attempt to improve querying speed.

Fixes #4360
Fixes #3515
and possibly more issues to do with slow queries.

This PR supersedes #4746.

What's been done

The album and item tables are joined, and corresponding data from item_attributes and album_attributes is merged and made available for filtering. This enables to achieve the following:

Faster album path queries, beet list -a path::some/path
Faster flexible attributes queries, both albums and tracks, beet list play_count:10
(New) Ability to filter albums with track-level (and vice-versa) db field queries, beet list -a title:something
(New) Ability to filter tracks with album-level flexible field queries, beet list artpath:cover
(New) Ability to filter albums with track-level flexible field queries, beet list -a art_source:something

Benchmarks

You can see that now querying speed is more or less constant regardless of the query, and the speed is mostly influenced by how many results need to be printed out

Compare this with what we had previously

To Do

Documentation. (If you've add a new command-line flag, for example, find the appropriate page under docs/ to describe it.)
Changelog. (Add an entry to docs/changelog.rst near the top of the document.)
Tests. (Encouraged but not strictly required.)

Later

#5318
#5319

github-actions · 2024-05-09T11:37:06Z

Thank you for the PR! The changelog has not been updated, so here is a friendly reminder to check if you need to add an entry.

snejus · 2024-05-09T21:05:38Z

I'm using the test-aura branch as the base since it depends on it getting merged. Though for now, I will change the base to master in order to run the tests.

snejus · 2024-05-09T21:09:10Z

beets/importer.py

@@ -1019,7 +1019,7 @@ def find_duplicates(self, lib):
        # temporary `Item` object to generate any computed fields.
        tmp_item = library.Item(lib, **info)
        keys = config["import"]["duplicate_keys"]["item"].as_str_seq()
-        dup_query = library.Album.all_fields_query(
+        dup_query = library.Item.match_all_query(


It seemed to me this was supposed to be Item instead of Album?

The Album and Item don't redefine all_fields_query, so calling it is effectively the same with an album or an item object as it's just using LibModel's function call. It definitely is a bit confusing though so it's a worth fixing!

I don't think it's the same, as it uses Album._fields and Item._fields which are different, see

@classmethod def field_query( cls, field, pattern, query_cls: Type[FieldQuery] = MatchQuery, ) -> FieldQuery: """Get a `FieldQuery` for this model.""" return query_cls(field, pattern, field in cls._fields) @classmethod def all_fields_query( cls: Type["Model"], pats: Mapping, query_cls: Type[FieldQuery] = MatchQuery, ): """Get a query that matches many fields with different patterns. `pats` should be a mapping from field names to patterns. The resulting query is a conjunction ("and") of per-field queries for all of these field/pattern pairs. """ subqueries = [cls.field_query(k, v, query_cls) for k, v in pats.items()] return AndQuery(subqueries)

Serene-Arc · 2024-05-10T05:06:18Z

I'll be able to review in a week or two, just end of semester push at the moment.

wisp3rwind · 2024-05-13T20:03:41Z

Hey, also just chiming in to say that it will take some time for me to go through the current batch of PRs. I won't be able to keep up the response times I had last week, but I'll slowly work through all of them.

Serene-Arc · 2024-06-13T03:33:11Z

I've left some comments on specific lines and things, but some general comments. I noticed that you changed field to col_name in the definition of FieldQuery which is reflected in bareasc. Do you think that this will have an impact on the compatibility with plugins more broadly? Going through the plugins with grep, I couldn't find any other instances but it is something to keep in mind. You haven't altered any of the interfaces in our docs so it isn't too much of an issue, but it will definitely have to be mentioned in the changelog.

I think it's quite clever how you merged all of the flexible attributes and optimised the SQL queries! This is a great PR, will really improve how beets works.

Btw, with those later PRs to do, I'd open bugs/enhancements for them so we don't forget!

Also, those benchmarks are great. We should add them as a poetry command I think, in another PR would probably be best. What have you used to make those?

snejus · 2024-06-13T13:53:45Z

I've left some comments on specific lines and things

Hm I'm not being able to see them! Did you submit the review?

beets/dbcore/db.py

beets/dbcore/query.py

Serene-Arc · 2024-06-13T03:15:52Z

beets/importer.py

@@ -1019,7 +1019,7 @@ def find_duplicates(self, lib):
        # temporary `Item` object to generate any computed fields.
        tmp_item = library.Item(lib, **info)
        keys = config["import"]["duplicate_keys"]["item"].as_str_seq()
-        dup_query = library.Album.all_fields_query(
+        dup_query = library.Item.match_all_query(


The Album and Item don't redefine all_fields_query, so calling it is effectively the same with an album or an item object as it's just using LibModel's function call. It definitely is a bit confusing though so it's a worth fixing!

beets/library.py

Serene-Arc · 2024-06-13T23:58:51Z

...I did not! I have now.

snejus · 2024-06-14T00:07:44Z

I noticed that you changed field to col_name in the definition of FieldQuery which is reflected in bareasc. Do you think that this will have an impact on the compatibility with plugins more broadly? Going through the plugins with grep, I couldn't find any other instances but it is something to keep in mind.

Well noted @Serene-Arc. I think I adjusted every internal plugin that used this functionality. Otherwise, I haven't seen any external plugins based on this interface.

You haven't altered any of the interfaces in our docs so it isn't too much of an issue, but it will definitely have to be mentioned in the changelog.

You're right, this completely needs a mention in the changelog!

snejus · 2024-06-16T00:52:42Z

Also, those benchmarks are great. We should add them as a poetry command I think, in another PR would probably be best. What have you used to make those?

This one's a bit complicated. 😅 In short,

used hyperfine to measure the performance between the two versions and save the results as JSON
jq to handle the JSON data
rich-tables to render it
zsh as the glue between these parts

In detail

1. Prepare each executable for testing

Install beets from master branch as beetmaster: (in the repository) pipx install . --suffix master
Install beets from my branch as beetfast: pipx install -e . --suffix fast.

Now one can run beetmaster and beetfast to test each version accordingly.

2. Prepare some queries for testing

I added the following contents to a file ./queries.txt. One query per line

artpath::a
artpath::a -a
path::a
path::a -a
art_source::a
art_source::a -a
title:a
title:a -a
play_count:5..
play_count:5.. -a
play_count:50..
play_count:50.. -a
art_source::a play_count::5 city:Be
art_source::a play_count::5 city:Be -a
art_source::. play_count:..5 city::.*
art_source::. play_count:..5 city::.* -a

3. Use `hyperfine` to measure the performance between the two

compare_performance() {
  # args: <queries-file> <results-json-file>
  # info: read queries from _queries-file, compare performance and save results to _results-json-file
  # read queries
  local queries_file=$1
  local results_json_target=$2

  local -a queries
  queries=(${(f@)"$(grep -v "^#" $queries_file)"})

  # compare performance of beetmaster and beetfast using hyperfine.
  # For each of the given queries, hyperfine runs each command 5 times
  # and exports the results to a json file.
  hyperfine --parameter-list query ${(j:,:)queries} --runs 5 --ignore-failure \
    --export-json $results_json_target \
    'beetmaster -l /home/sarunas/.music/beets/library.db list {query}' \
    'beetfast -l /home/sarunas/.music/beets/library.db list {query}'
}

4. Count how many items each command returns and insert it into the results

add_results_count() {
  # args: <performance-results-json> <performance-with-results-count-json-target>
  # info: count the number of results each command returns for each query, add
  ## this to the performance results json file and save it in the given target file
  local performance_json=$1
  local target_json=$2
  local -a runs
  runs=(${(f@)"$(jq '.results[].command' < $performance_json -r)"})

  local run
  rm results_count.txt
  # save each command (which includes the executable and query) and the number of results it returned
  for run in $runs; do
    print "$run $($run | wc -l)" >> results_count.txt
  done

  # parse the results to JSON objects
  jq '
    capture("(?<command>.*) (?<results>[^ ]+)$")
    | {(.command): (.results | tonumber)}
  ' -R <results_count.txt |
    jq add -s > results_count.json

  # add the results count to the performance results json
  jq '.results |= map(.results_count = $index[.command])' \
    --argjson index "$(<results_count.json)" \
    <$performance_json >$target_json
}

5. Get relevant data from the results JSON and display them with `rich-tables`

show_hyperfine_results() {
  # args: <results-json-file>
  # info: print hyperfine results nicely
  local -a json
  zparseopts -D -E j=json
  jq '
    def mv(before; after): .[after] = .[before] | del(.[before]);
    def rm(regex): gsub(regex; "");
    def round(num): . * pow(10; num) | round / pow(10; num);
    def group_map(field): 
        (field | split(",") | map(rm("^ *| *$"))) as $key
        | sort_by(.[$key[]])
        | group_by(.[$key[]]) 
        | sort_by(length)
        | (map({([.[0][$key[]]| if type == "array" then join(", ") end|tostring] | join(" | ")): map(del(.[$key[]]))}) | add);

    def group_map_dict(field): 
        (field | split(",")) as $key
        | sort_by(.[$key[]])
        | group_by(.[$key[]]) 
        | sort_by(length)
        | map((.[0] as $item | $key | map({(.): $item[.]}) | add) + {"items": map(del(.[$key[]]) | select(len(.) > 0))});

    .results
    | map(
      del(.times, .stddev, .median, .system, .user, .exit_codes)
      | (.mean, .min, .max) |= round(2)
      | mv("mean"; "duration")
      | mv("results_count"; "results")
      | .command |= (rm("BEETSDIR= ") | rm(" .*"))
      | .parameters |= .query
      | if (.parameters | test(" -a")) then .type = "album" else .type = "item" end
      | {type, command, parameters, results, duration}
    )
    | group_map("type")
    # | map_values(
      # group_map_dict("parameters")
    # )
    # | sort_by(.items[0].mean - .items[1].mean)
  ' $1 | {
    if (( $#json )); then
      jq
    else
      table
    fi
  }
}

Steps 3-5

benchmark_beets() {
  # args:
  # info: benchmark beetmaster and beetfast
  local queries_file=queries.txt
  local performance_json=master-vs-beetfast-counts.json
  local results_json=benchmark_with_counts.json

  compare_performance $queries_file $performance_json
  add_results_count $performance_json $results_json
  show_hyperfine_results $results_json
}

arsaboo · 2024-06-18T13:46:56Z

@snejus I just updated after this PR, and it seems that everything is painfully slow. Even a simple beet update takes a significantly long time. Are we supposed to do anything else?

Serene-Arc · 2024-06-19T04:09:49Z

@arsaboo what kind of requests are taking longer?

arsaboo · 2024-06-19T10:31:22Z

Beet update seems like the worst hit; even beet import seems slow. Let me know what additional information you would like.

snejus · 2024-06-19T16:37:49Z

hmm, interesting. @Serene-Arc , do not release a new version just yet, I will investigate this over the coming days

snejus · 2024-06-19T16:38:47Z

@arsaboo, can you provide the output of beet stats here?

Also, what's your system, Python and SQLite versions?

arsaboo · 2024-06-19T20:37:41Z

I added time to the command....even stats takes time.

time beet stats
Tracks: 55012
Total time: 27.3 weeks
Approximate total size: 1.2 TiB
Artists: 16399
Albums: 11414
Album artists: 2361

real    0m27.519s
user    0m24.341s
sys     0m3.152s

Here's another output from beet import even after I specified the id (it took over a minute after accepting the prompt immediately). This used to be almost instantaneous:

$ time beet import -m -I -t ~/shared/music/ --set genre="Filmi" --search-id 0vGMpTlGXYZ
deezer: Deezer API error: no data

/home/arsaboo/shared/music/Jubin Nautiyal/REDACTED (2024) (1 items)

  Match (88.2%):
  Jubin Nautiyal - REDACTED
  ≠ album, tracks
  Spotify, None, 2024, None, Tips Industries Ltd, None, None
  https://open.spotify.com/album/0vGMpTlGXYZ
  * Artist: Jubin Nautiyal
...
➜ [A]pply, More candidates, Skip, Use as-is, as Tracks, Group albums,
Enter search, enter Id, aBort, Print tracks, Open files with Picard,
eDit, edit Candidates?

real    1m15.144s
user    1m1.506s
sys     0m10.224s

~$ sqlite3 --version
3.37.2 2022-01-06 13:25:41 872ba256cbf61d9290b571c0e6d82a20c224ca3ad82971edc46b29818d5dalt1
$ python3 --version
Python 3.10.12

snejus · 2024-06-19T20:40:33Z

I see. You've got old SQLite version which doesn't have built-in JSON functionality - could you try upgrading your SQLite to a version higher than 3.38.0?

You can see in the code that we define the required JSON functions for SQLite < 3.38.0 which are definitely going to be slower than the builtin ones. I didn't realise they were so slow though.

snejus · 2024-06-19T20:50:56Z

@arsaboo can you also confirm you're using Ubuntu?

arsaboo · 2024-06-19T20:52:26Z

yes, I am on Ubuntu

arsaboo · 2024-06-19T20:53:03Z

Shouldn't sqlite be automatically updated?

snejus · 2024-06-19T20:53:12Z

I am reverting this change, look for an incoming pr

This reverts commit 143b920, reversing changes made to 8508a57.

snejus · 2024-06-19T20:58:39Z

See #5326

snejus · 2024-06-19T21:00:29Z

Shouldn't sqlite be automatically updated?

Unfortunately Ubuntu does not keep your packages up to date :(. In any case, even with an up-to-date SQLite you will see that these commands take about twice as long to execute - thus the revert, at least for now!

Serene-Arc · 2024-06-20T04:26:54Z

How unfortunate! This PR was very well done. It does highlight that we should have testing in place though for benchmarking purposes, at least where these types of PRs are concerned. And perhaps a wider range of machines on CI for benchmarking, such as specific ubuntu distros and so on, rather than just the latest.

Fixes #4360 This PR enables querying albums by track fields and tracks by album fields, and speeds up querying albums by `path` field. It originally was part of #5240, however we found that the changes related to the flexible attributes caused degradation in performance. So this PR contains the first part of #5240 which joined `items` and `albums` tables in queries.

snejus · 2024-12-03T04:54:09Z

@arsaboo have you updated to Ubuntu 24?

snejus · 2024-12-03T04:55:56Z

How unfortunate! This PR was very well done. It does highlight that we should have testing in place though for benchmarking purposes, at least where these types of PRs are concerned. And perhaps a wider range of machines on CI for benchmarking, such as specific ubuntu distros and so on, rather than just the latest.

Agree with you! I was aware that it may be painfully slow on older versions of SQLite, it just that I wasn't aware that at that point Ubuntu shipped with such an old version of SQLite.

arsaboo · 2024-12-03T04:59:32Z

Yes, updated now (after the initial feedback). Does it work better on 24?

snejus · 2024-12-28T07:14:31Z

The SQLite version on 22 did not yet include the JSON extension, so it was painfully slow.

Try checking out the only-fast-filtering branch: how fast are queries? Is the speed comparable to their speed on master?

arsaboo · 2024-12-28T18:15:21Z

@snejus can you rebase the only-fast-filtering branch so I can try it again?

snejus self-assigned this May 9, 2024

snejus requested a review from wisp3rwind May 9, 2024 11:36

snejus mentioned this pull request May 9, 2024

Use SQL to query flex fields, and related Album/Item data #4746

Closed

3 tasks

snejus requested review from sampsyo, Serene-Arc and JOJ0 May 9, 2024 11:38

snejus force-pushed the only-fast-filtering branch from ad2ea5a to 9ceffb6 Compare May 9, 2024 11:40

snejus added the review-needed label May 9, 2024

snejus changed the base branch from test-aura to master May 9, 2024 21:05

snejus force-pushed the only-fast-filtering branch from 3c293fd to bdb7fd9 Compare May 9, 2024 21:06

snejus commented May 9, 2024

View reviewed changes

snejus force-pushed the only-fast-filtering branch from bdb7fd9 to 070c87f Compare May 10, 2024 08:58

snejus force-pushed the only-fast-filtering branch from 070c87f to e2ffa2d Compare June 3, 2024 12:17

snejus force-pushed the only-fast-filtering branch from e2ffa2d to 1ab1abd Compare June 10, 2024 13:34

Serene-Arc requested changes Jun 13, 2024

View reviewed changes

snejus force-pushed the only-fast-filtering branch from 1ab1abd to a8c0cf4 Compare June 14, 2024 12:25

snejus force-pushed the only-fast-filtering branch 4 times, most recently from c7924ba to f253d4d Compare June 16, 2024 13:08

Make LazyClassProperty / cached_classproperty reusable

265e40b

Serene-Arc mentioned this pull request Jun 17, 2024

Add a benchmark command to poe #5320

Open

Keep the same FieldQuery.field interface as before

2c4b42d

Serene-Arc approved these changes Jun 17, 2024

View reviewed changes

snejus merged commit 143b920 into master Jun 17, 2024
12 checks passed

snejus deleted the only-fast-filtering branch June 17, 2024 08:10

snejus added a commit that referenced this pull request Jun 19, 2024

Revert "Make queries fast, filter all flexible attributes (#5240)"

2800a32

This reverts commit 143b920, reversing changes made to 8508a57.

snejus restored the only-fast-filtering branch June 19, 2024 20:59

This was referenced Jun 19, 2024

beet list -a path:some/path queries are extremely slow #4360

Closed

Track-level query returning list of albums containing those tracks #3515

Open

Filter albums by tracks fields and vice versa #5327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make queries fast, filter all flexible attributes #5240

Make queries fast, filter all flexible attributes #5240

snejus commented May 9, 2024 •

edited

Loading

github-actions bot commented May 9, 2024

snejus commented May 9, 2024

snejus May 9, 2024

Serene-Arc Jun 13, 2024

snejus Jun 16, 2024

Serene-Arc commented May 10, 2024

wisp3rwind commented May 13, 2024

Serene-Arc commented Jun 13, 2024

snejus commented Jun 13, 2024

Serene-Arc Jun 13, 2024

Serene-Arc commented Jun 13, 2024

snejus commented Jun 14, 2024

snejus commented Jun 16, 2024

arsaboo commented Jun 18, 2024

Serene-Arc commented Jun 19, 2024

arsaboo commented Jun 19, 2024

snejus commented Jun 19, 2024 •

edited

Loading

snejus commented Jun 19, 2024 •

edited

Loading

arsaboo commented Jun 19, 2024 •

edited

Loading

snejus commented Jun 19, 2024 •

edited

Loading

snejus commented Jun 19, 2024

arsaboo commented Jun 19, 2024

arsaboo commented Jun 19, 2024

snejus commented Jun 19, 2024

snejus commented Jun 19, 2024

snejus commented Jun 19, 2024 •

edited

Loading

Serene-Arc commented Jun 20, 2024

snejus commented Dec 3, 2024

snejus commented Dec 3, 2024

arsaboo commented Dec 3, 2024

snejus commented Dec 28, 2024

arsaboo commented Dec 28, 2024 •

edited

Loading

Make queries fast, filter all flexible attributes #5240

Make queries fast, filter all flexible attributes #5240

Conversation

snejus commented May 9, 2024 • edited Loading

Description

What's been done

Benchmarks

To Do

Later

github-actions bot commented May 9, 2024

snejus commented May 9, 2024

snejus May 9, 2024

Choose a reason for hiding this comment

Serene-Arc Jun 13, 2024

Choose a reason for hiding this comment

snejus Jun 16, 2024

Choose a reason for hiding this comment

Serene-Arc commented May 10, 2024

wisp3rwind commented May 13, 2024

Serene-Arc commented Jun 13, 2024

snejus commented Jun 13, 2024

Serene-Arc Jun 13, 2024

Choose a reason for hiding this comment

Serene-Arc commented Jun 13, 2024

snejus commented Jun 14, 2024

snejus commented Jun 16, 2024

In detail

1. Prepare each executable for testing

2. Prepare some queries for testing

3. Use hyperfine to measure the performance between the two

4. Count how many items each command returns and insert it into the results

5. Get relevant data from the results JSON and display them with rich-tables

Steps 3-5

arsaboo commented Jun 18, 2024

Serene-Arc commented Jun 19, 2024

arsaboo commented Jun 19, 2024

snejus commented Jun 19, 2024 • edited Loading

snejus commented Jun 19, 2024 • edited Loading

arsaboo commented Jun 19, 2024 • edited Loading

snejus commented Jun 19, 2024 • edited Loading

snejus commented Jun 19, 2024

arsaboo commented Jun 19, 2024

arsaboo commented Jun 19, 2024

snejus commented Jun 19, 2024

snejus commented Jun 19, 2024

snejus commented Jun 19, 2024 • edited Loading

Serene-Arc commented Jun 20, 2024

snejus commented Dec 3, 2024

snejus commented Dec 3, 2024

arsaboo commented Dec 3, 2024

snejus commented Dec 28, 2024

arsaboo commented Dec 28, 2024 • edited Loading

snejus commented May 9, 2024 •

edited

Loading

3. Use `hyperfine` to measure the performance between the two

5. Get relevant data from the results JSON and display them with `rich-tables`

snejus commented Jun 19, 2024 •

edited

Loading

snejus commented Jun 19, 2024 •

edited

Loading

arsaboo commented Jun 19, 2024 •

edited

Loading

snejus commented Jun 19, 2024 •

edited

Loading

snejus commented Jun 19, 2024 •

edited

Loading

arsaboo commented Dec 28, 2024 •

edited

Loading