-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] How should I paginate my data using DataLoader? #231
Comments
One question @lffg , are you using dataloader already in your project? Dataloader does not have any concept of database. It's only a utility that maps many calls to a function Therefore, for |
@lffg Most likely there won't be any UI screens where you would need to fetch the list of users alongside their comments (thinking of user profile page for example, where you would need to fetch comments just for this specific user). If that's true, you may want to move the "comments" field away from the "user" object higher to the top-level (root) object type. You may want to introduce a convention in your API design stating that the top-level fields must fetch data directly (from database), instead of using data loaders. The GraphQL query for fetching a user + his comments may look like this: query {
user(username: "john_wick") { ... }
comments(author: "john_wick", first: 50, after: 100) { ... }
}
# OR, if you like to fetch data by user ID
query {
user: node(id: "123") { ... on User { ... } }
comments(authorId: "123", first: 50, after: 100) { ... }
} Note that the same top-level query {
story(slug: "hello-world-ae3c21") { ... }
comments(story: "hello-world-aw3c21") { ...}
} And here is an example of the import { GraphQLNonNull, GraphQLString } from 'graphql';
import {
connectionDefinitions,
forwardConnectionArgs,
connectionFromArraySlice,
cursorToOffset,
} from 'graphql-relay';
import db from '../db';
import { countField } from '../utils';
import { CommentType } from '../types';
export const comments = {
type: connectionDefinitions({
name: 'Comment',
nodeType: CommentType,
connectionFields: { totalCount: countField },
}).connectionType,
args: {
...forwardConnectionArgs,
story: { type: GraphQLString },
author: { type: GraphQLString },
},
async resolve(root, args, ctx) {
const query = db.table('comments');
// Optionally, filter by story slug
if (args.story) {
query.leftJoin('stories', 'story.id', 'comments.story_id');
query.where('stories.slug', '=', args.story);
}
// Optionally, filter by author
if (args.author) {
query.leftJoin('users', 'users.id', 'comments.author_id');
query.where('users.username', '=', args.author);
}
const limit = args.first === undefined ? 20 : args.first;
const offset = args.after ? cursorToOffset(args.after) + 1 : 0;
const data = await query
.clone()
.limit(limit)
.offset(offset)
.select('comments.*');
return {
...connectionFromArraySlice(data, args, {
sliceStart: offset,
arrayLength: offset + data.length,
}),
query,
};
},
}; See also https://github.com/kriasoft/relay-starter-kit -> |
That query was an example... I don't have any page that needs that. I am just trying to experiment some pagination with DataLoader. Still, no success with that. Nevertheless, thanks @koistya, changing the query might help, although I think I lose some of GraphQL's power by doing so. If anybody can help me with further information or advice, I will appreciate. |
I read through the discussion but it did not become clear to me what the preferred practice would be. I read through the readme carefully and found this section, so it appears as though we are meant to ( I do realize that this is not a "you can only implement it like so" scenario ) fetch the user id's and order them accordingly, and then fire another request for the actual records with the corresponding id's, which plays well with their Ent pattern. I do not realize, however, how the |
I have a little solution for pagination with dataloader. It's not that advanced but makes the job done for me now.
So, let's say that we have a "Feeds" page like Twitter. It loads posts, shows comment counts of each post, maybe likes/dislikes etc too. To simplify it let's just say that we need argsArray.map(args => {
const sqlQuery = sql`
SELECT *, ${objectHash(args)} AS cacheKey
...
`
}) So in the end, we have a query like this: SELECT *, 'qwerty' as cacheKey
....
UNION ALL
SELECT *, 'asdfg' as cacheKey
.....
UNION ALL
.... When we got the result, I just map the rows to their cacheKeys and return the result to dataloader. This method helps me to reduce db hits. 1 for posts, 1 for comment counts of each post and one for paginated comments. I couldn't find a ORM to support a query like this. I ended up using raw SQL after all my trials. That is a little downside of this method. And actually, even if this is a simple scenario, this can get messy for complex scenarios. I'm not a deep-dive graphql expert, but this should be a pretty common problem. But let alone the answers, there are nearly no questions around. Only a few topics and issues. So, I don't know if people just don't care about how many times they queried the db for a simple list, or if they use another caching/batching/query method. This issue is one of the the biggest problems with graphql APIs for me. |
A little update: Using {
commentsCount # or commentCount, that's up to you
comments {
edges { ... }
pageInfo { .... }
}
} If you look at schema of |
Note: Consider this more of a brainstorming ideas as opposed to an actual solution and do not use it in your own project before investigating the performance impact of the queries. I would also suggest that you read the previous comment, as after some digging on this issue, I found that the best solution is to actually not having to do it at all. Still, should you end up in a position where you need to accomplish this, the information below might be of some value to you. You may be able to accomplish the query through an ORM or a QueryBuilder ( I'm using Sequelize and its syntax for this query is more complex than the raw sql counterpart ) and the example assumes some sql familiarity. The example I will be using assumes that you want to fetch and limit/order the comments for a post. Do also note, the example doesn't really solve any real-world scenario. post(id: 1) {
title
content
comments(first: 24, orderBy: "createdAt", sortOrder: "desc") {
text
author {
name
}
}
} We will be using two database queries to retrieve the data. Query 1 - Load the I will not be covering DataLoader Factory patterns ( there are multiple ways of abstracting this, but we will be using the simplest way possible ). This example assumes the following database tables posts - id, title, content const commentLoader = new Dataloader(...) // Fetches comments by their id's
const postCommentsLoader = new Dataloader(async postIds => {
const { rows } = await knex.raw(`
SELECT
ARRAY (
SELECT pc.id FROM "postComments" pc
WHERE pc."postId" = p.id
ORDER BY pc."order" DESC -- Custom ordering
LIMIT 24 -- Custom limiting
) AS postCommentIds,
p.id AS "postId"
FROM "posts" p
WHERE p.id IN(${postIds.join(',')}) -- Use variables instead, this is prone to sql injection
`)
/**
* Objects in `rows` have this format
*/
[
{
postId: 1,
postComments: [3,2,1] // Comment ids
},
...
]
const commentIdsByPostId = rows.reduce((acc, { postId, postCommentIds }) => {
acc[postId] = postCommentIds
return acc
}, {})
return postIds.map(postId => commentIdsByPostId[postId])
})
const resolvers = {
Post: {
async comments(post, args, context) {
const postCommentIds = await context.postCommentsLoader.load(post.id)
const comments = await context.commentLoader.loadMany(postCommentIds)
return comments
}
}
} If you think that this is quite some boilerplate - you are correct. Most of this can be abstracted with smarter code organization patterns but its still not ideal. For example, in my application, this can be translated to something similar to this const commentsById = createDataloader("postComments")
const postCommentsByPostId = createDataloader({
queryFn: postIds => knex.raw(`sql`).then(({ rows }) => rows),
key: "postId",
rowToValue: ({ postComments }) => postComments,
pipeThrough: postCommentIds => commentsById.loadMany(postCommentIds)
})
const resolvers = {
Post: {
async comments(post, args, context) {
// The chaining is done via the `pipeThrough`, so we must issue a single call here
const comments = await context.postCommentsByPostId.load(post.id)
return comments
}
}
} If you read carefully, however, you'd see that this example does not use runtime ordering or filtering. The After the digging mentioned in the beginning, I have come to the conclusion ( like others ), that it would be better to provide a specialized query ( like mentioned in that twitter post from the comment above ) instead, and that specialized query is the query in my example. So, what would the solution be and what exactly does a specialized query mean? I interpreted it like so ( albeit the chosen example is not the best fit ). If you have two ( or more ) use-cases in your application where you need to fetch a different number of comments, you can create different fields for them, with different resolvers that have hard-coded arguments. So, instead of our initial example with last/order/sortOrder varaibles, we would have Post {
comments: [Comment!]! # last: 24 orderBy: "createdAt" sortOrder: "DESC"
recentComments: [Comment!] # last: 12 orderBy: "createdAt" sortOrder: "DESC"
} That way, instead of one field with dynamic variables, you have two fields with hard-coded variables. If you still need to use the single field method with dynamic variables, according to my findings you can use the Since the only way to provide variables to a Dataloader instance is through arguments to postCommentsByPostId.load({
postId: 1,
order: `args.orderBy args.sortOrder`,
limit: args.first,
}) Then, inside the Dataloader resolver, you can grab the first key and construct the query based on its properties const postCommentsByPostId = new Dataloader(postIds => {
// postIds argument contains objects, we only care about the id's
const ids = postIds.map(({ postId }) => postId)
// We could ( not always ) assume that order/limit will be the same across all `postIds` objects
const { order, limit } = postIds[0]
// Use sql variables instead, this is also prone to sql injection
const { rows } = await knex.raw(`
SELECT
ARRAY (
SELECT pc.id FROM "postComments" pc
WHERE pc."postId" = p.id
ORDER BY pc."${order}" -- Custom ordering
LIMIT ${limit}-- Custom limiting
) AS postCommentIds,
p.id AS "postId"
FROM "posts" p
WHERE p.id IN(${postIds.join(',')})
`)
...
})
const resolvers = {
Post: {
async comments(post, args, context) {
// This is why we would assume all properties would be the same,
// Because the are called from the same resolver and collected in a single tick
const postCommentIds = await context.postCommentsLoader.load({
postId: post.id,
limit: args.first,
order: `${args.orderBy} ${args.sortOrder}`
})
const comments = await context.commentLoader.loadMany(postCommentIds)
return comments
}
}
} Because the Dataloader collects all calls to There is one better solution that was mentioned in the beginning and it is to design your schemas in a way that you don't end up in a situation where you need to do this. |
If you are using a newer DB like Mysql 8.0 <, then you could try
https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql?view=sql-server-ver15 |
My brain storming here. How about I didn't test huge data performance, but it looks working. Complete example here. import 'dotenv/config'
import Dataloader from 'dataloader'
import knex from 'knex'
const db = knex({
client: 'pg',
connection: process.env.DATABASE_URL,
useNullAsDefault: false,
debug: true,
})
interface PostsByUserIdParams {
limit: number
id: number
}
const postsByUserId = new Dataloader(
(params: readonly PostsByUserIdParams[]) => {
// Assuming all params limits are the same value
const limit = params[0].limit
// Get ids to load
const ids = params.map((param) => param.id)
// Here is the magic, this SQL loads each ids with each limit
// Note that is POC code and could be SQL injection
const query = ids
.map((id) => `(select * from posts where user_id = ${id} limit ${limit})`)
.join(' union all ')
return db
.raw(query)
.then(({ rows }) =>
ids.map((id) => rows.filter((x: number) => x.user_id === id)),
)
},
)
async function main() {
await db.schema.dropTableIfExists('posts')
await db.schema.createTable('posts', (t) => {
t.increments('id')
t.integer('user_id')
})
await db('posts').insert([
{ user_id: 1 }, { user_id: 1 },
{ user_id: 2 }
])
const user1 = postsByUserId.load({ limit: 10, id: 1 })
const user2 = postsByUserId.load({ limit: 10, id: 2 })
const user3 = postsByUserId.load({ limit: 10, id: 3 })
const user4 = postsByUserId.load({ limit: 10, id: 4 })
const user5 = postsByUserId.load({ limit: 10, id: 5 })
console.log('user1', await user1)
console.log('user2', await user2)
}
main() Logs: |
After nearly 1 year, I can agree on using |
@onderonur Thanks for sharing your opinion. Yes, it's pretty complicated and hard to implement it. |
UNION works alright but its performance is worse than using a PARTITION BY query. P.s UNIONS are not supported well by sequelize, use a query builder instead |
@KDKHD Yeah ORM support is limited for this operation. Do you have a good example code for |
Hi, this is a quick example. Imagine your database stores this data: Your database stores information about different sensors in your house for example. Each row in your sensors table represents a sensor and has a location column. Let's say you want to get the first 5 sensors for each location (this is where you paginate). Your partition by query would look like this:
When you want to get the next 5, you just change:
in typeORM you can add prototypes like this:
This is a pretty vague description, might make a proper writeup about this since this seems like a big issue |
@KDKHD Thank you for the code. The solution to use |
For this cases I am using SELECT e0.id, e1.* FROM gallery AS e0 LEFT JOIN LATERAL (
SELECT e2.* FROM photo AS e2 WHERE e2.gallery_id = e0.id LIMIT ?
) AS e1 ON e0.id = e1.gallery_id WHERE e1.gallery_id IN (?) I am using mikro-orm that internally use knex but unfortunately this last don`t support it, so I have to write the queries in SQL. |
For my data model the right thing seems to be to split all queries into two types: Search queries that fetch IDs of entities, and entity fetch queries that fetch the actual entities. This allows implementing search type of graphQL fields where there may be multiple predicates and support paging, but still get the benefit of using DataLoaders for further queries down the graph that may fetch the same items, going back up the graph. Example using C# graphql-dotnet: this.Field<PagedCollectionType<SensorGroup, SensorGroupGraphType>>()
.Name("sensorGroups")
.PaginationArguments()
.ResolveAsync(async context =>
{
// Get a paginated list of SensorGroup IDs
var sensorGroupIds = await sensorGroupService
.GetPagedSensorGroupIdsAsync(
context.PaginationOptions());
// Fetch the sensor groups by IDs with data loader
return sensorGroupDataLoader
.LoadAsync(sensorGroupIds)
.Then(sensorGroups => sensorGroups.AsPagedCollection(
sensorGroupIdsByLocation.AsOptions(),
sensorGroupIdsByLocation.Count));
}); The search isn't cached by a DataLoader because its very unlikely that you'd have two exactly the same search fields in the same query. The DB query for the search can return data straight from the index. |
For data that has DDL look like this: CREATE TABLE author (
id SERIAL PRIMARY KEY,
name VARCHAR NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE book (
id SERIAL PRIMARY KEY,
name VARCHAR NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
author_id INTEGER NOT NULL REFERENCES author(id) ON DELETE CASCADE
); I can come up with 3 solutions:
|
I've been checking out postgraphile which builds pretty compact queries. This is my gql query:
This is the PSQL query generated (idk how this query works exactly or its efficiency vs the PARTITION BY query that I shared earlier):
and this is the PSQL result:
Which I guess through some post-processing gets converted into the expected gql response. I wish I knew about postgraphile while I was trying to build my own gql + dataloader + pagination implementation which is why I am sharing this here. Also, this query could maybe give some inspiration on how to use dataloader + pagination better. |
Here is an example prisma implementation of the custom PARTITION BY helper function I mentioned above: https://github.com/KDKHD/GraphQL/blob/44185c080e3c54e5cdda9bdc2be92e24424ea3ac/utils/queryHelpers.ts#L30 |
I have the following query:
How can I handle pagination in those cases? Should I use DataLoader?
Until now, everything that I found on the web suggests using the
splice
array method, but if I use it, my server will run out of memory, as I should paginate the data on the SQL query, am I right?So how I am able to query my database with such GraphQL query? Do you handle those cases in any of your projects?
I am really struggling with that. It is even possible to use DataLoader in this case? And if I not use it, I my server will suffer a lot with the N+1 problem...
The text was updated successfully, but these errors were encountered: