-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: limiting scrape to 100 urls per message #619
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Sahil Silare <[email protected]>
This PR will trigger a minor release when merged. |
Signed-off-by: Sahil Silare <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please provide context why this change is needed and involve other stakeholders of the scraper cc @blefebvre
Found an issue with scraper - lambda time limit is 15 mins and scraping is taking more time than that. |
Signed-off-by: Sahil Silare <[email protected]>
Signed-off-by: Sahil Silare <[email protected]>
context, | ||
), | ||
]; | ||
await Promise.all(promises).then(() => say(`:adobe-run: Triggered scrape run for site \`${baseURL}\``)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:adobe-run: Triggered scrape run for site \
${baseURL}`
Should this say
happen before the await Promise.all(promises)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are saying
'completed triggering scrape runs ..' just after this, can we say here: :adobe-run: Triggering scrape run for site \
${baseURL}`` before awaiting for the promises.
const message = `:white_check_mark: Completed triggering scrape runs for site \`${baseURL}\` — Total URLs: ${urls.length}`; | ||
|
||
await say(message); | ||
const half = Math.ceil(urls.length / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For handling a single URL: Does triggerScraperRun
handle an empty URLs array gracefully?
Scraping 100 sites still seems like a big task and can exceed the time limits in some cases where the site is slow to respond, should we go with ~50? Less number is safer, with almost no extra cogs. |
const message = `:white_check_mark: Completed triggering scrape runs for site \`${baseURL}\` — Total URLs: ${urls.length}`; | ||
|
||
await say(message); | ||
const half = Math.ceil(urls.length / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not depend on the count of the urls, lets say in future we start saving more than 200 pages. In that case we would start scraping more than 100 pages in which case the timeout issue can arise again.
Also, if the total top pages is not a large number, IIRC we have 17 for sunstar, we would divide that into 8, 7 and send 2 scrape messages which would be unnecessary.
To avoid these issues, lets process batches of a fixed number of urls, 50 maybe.
context, | ||
), | ||
]; | ||
await Promise.all(promises).then(() => say(`:adobe-run: Triggered scrape run for site \`${baseURL}\``)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are saying
'completed triggering scrape runs ..' just after this, can we say here: :adobe-run: Triggering scrape run for site \
${baseURL}`` before awaiting for the promises.
Signed-off-by: Sahil Silare <[email protected]>
b5497f3
to
4f7a570
Compare
Signed-off-by: Sahil Silare <[email protected]>
Signed-off-by: Sahil Silare <[email protected]>
[SITES-27245] Content scraper exceeding Lambda time limit to scrape
Please ensure your pull request adheres to the following guidelines:
describe here the problem you're solving.
If the PR is changing the API specification:
yet. Ideally, return a 501 status code with a message explaining the feature is not implemented yet.
If the PR is changing the API implementation or an entity exposed through the API:
If the PR is introducing a new audit type:
Related Issues
Thanks for contributing!