feat: limiting scrape to 100 urls per message #619

ssilare-adobe · 2024-12-04T17:49:06Z

[SITES-27245] Content scraper exceeding Lambda time limit to scrape

Please ensure your pull request adheres to the following guidelines:

make sure to link the related issues in this description. Or if there's no issue created, make sure you
describe here the problem you're solving.
when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes

If the PR is changing the API specification:

make sure you add a "Not implemented yet" note the endpoint description, if the implementation is not ready
yet. Ideally, return a 501 status code with a message explaining the feature is not implemented yet.
make sure you add at least one example of the request and response.

If the PR is changing the API implementation or an entity exposed through the API:

make sure you update the API specification and the examples to reflect the changes.

If the PR is introducing a new audit type:

make sure you update the API specification with the type, schema of the audit result and an example

Related Issues

Thanks for contributing!

Signed-off-by: Sahil Silare <[email protected]>

github-actions · 2024-12-04T17:49:11Z

This PR will trigger a minor release when merged.

Signed-off-by: Sahil Silare <[email protected]>

solaris007

please provide context why this change is needed and involve other stakeholders of the scraper cc @blefebvre

ssilare-adobe · 2024-12-04T18:40:38Z

please provide context why this change is needed and involve other stakeholders of the scraper cc @blefebvre

Found an issue with scraper - lambda time limit is 15 mins and scraping is taking more time than that.

Signed-off-by: Sahil Silare <[email protected]>

blefebvre · 2024-12-05T15:57:20Z

src/support/slack/commands/run-scrape.js

+          context,
+        ),
+      ];
+      await Promise.all(promises).then(() => say(`:adobe-run: Triggered scrape run for site \`${baseURL}\``));


:adobe-run: Triggered scrape run for site \${baseURL}`

Should this say happen before the await Promise.all(promises)?

Since we are saying 'completed triggering scrape runs ..' just after this, can we say here: :adobe-run: Triggering scrape run for site \${baseURL}`` before awaiting for the promises.

blefebvre · 2024-12-05T15:59:27Z

src/support/slack/commands/run-scrape.js

-      const message = `:white_check_mark: Completed triggering scrape runs for site \`${baseURL}\` — Total URLs: ${urls.length}`;
-
-      await say(message);
+      const half = Math.ceil(urls.length / 2);


For handling a single URL: Does triggerScraperRun handle an empty URLs array gracefully?

dipratap · 2024-12-09T10:28:53Z

Scraping 100 sites still seems like a big task and can exceed the time limits in some cases where the site is slow to respond, should we go with ~50? Less number is safer, with almost no extra cogs.

dipratap · 2024-12-09T10:35:47Z

src/support/slack/commands/run-scrape.js

-      const message = `:white_check_mark: Completed triggering scrape runs for site \`${baseURL}\` — Total URLs: ${urls.length}`;
-
-      await say(message);
+      const half = Math.ceil(urls.length / 2);


I think we should not depend on the count of the urls, lets say in future we start saving more than 200 pages. In that case we would start scraping more than 100 pages in which case the timeout issue can arise again.
Also, if the total top pages is not a large number, IIRC we have 17 for sunstar, we would divide that into 8, 7 and send 2 scrape messages which would be unnecessary.
To avoid these issues, lets process batches of a fixed number of urls, 50 maybe.

dipratap · 2024-12-09T10:39:11Z

src/support/slack/commands/run-scrape.js

+          context,
+        ),
+      ];
+      await Promise.all(promises).then(() => say(`:adobe-run: Triggered scrape run for site \`${baseURL}\``));


Since we are saying 'completed triggering scrape runs ..' just after this, can we say here: :adobe-run: Triggering scrape run for site \${baseURL}`` before awaiting for the promises.

Signed-off-by: Sahil Silare <[email protected]>

feat: limiting scrape to 100 urls per message

23c6acc

Signed-off-by: Sahil Silare <[email protected]>

Merge remote-tracking branch 'upstream/main' into SITES-27245

680de0a

Signed-off-by: Sahil Silare <[email protected]>

ssilare-adobe requested review from solaris007 and dzehnder December 4, 2024 17:59

solaris007 requested changes Dec 4, 2024

View reviewed changes

ssilare-adobe requested a review from blefebvre December 4, 2024 18:39

ssilare-adobe added 3 commits December 5, 2024 09:06

fix: split batch into half

c0637d4

Signed-off-by: Sahil Silare <[email protected]>

Merge remote-tracking branch 'upstream/main' into SITES-27245

ade6ca2

fix: added message

e8e885a

Signed-off-by: Sahil Silare <[email protected]>

habansal requested a review from dipratap December 5, 2024 04:51

blefebvre reviewed Dec 5, 2024

View reviewed changes

dipratap requested changes Dec 9, 2024

View reviewed changes

fix: added different id

4f7a570

Signed-off-by: Sahil Silare <[email protected]>

ssilare-adobe force-pushed the SITES-27245 branch 2 times, most recently from b5497f3 to 4f7a570 Compare December 13, 2024 11:14

ssilare-adobe added 4 commits December 13, 2024 16:45

Merge remote-tracking branch 'upstream/main' into SITES-27245

1553853

fix(deps): update dependency of @adobe/spacecat-shared-utils to v1.23.7

6b74217

Signed-off-by: Sahil Silare <[email protected]>

Merge remote-tracking branch 'upstream/main' into SITES-27245

ff63d7a

fix: added 50 urls batching

65bb13c

Signed-off-by: Sahil Silare <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: limiting scrape to 100 urls per message #619

feat: limiting scrape to 100 urls per message #619

ssilare-adobe commented Dec 4, 2024 •

edited

Loading

github-actions bot commented Dec 4, 2024 •

edited

Loading

solaris007 left a comment

ssilare-adobe commented Dec 4, 2024

blefebvre Dec 5, 2024

dipratap Dec 9, 2024 •

edited

Loading

blefebvre Dec 5, 2024

dipratap commented Dec 9, 2024

dipratap Dec 9, 2024

dipratap Dec 9, 2024 •

edited

Loading

feat: limiting scrape to 100 urls per message #619

Are you sure you want to change the base?

feat: limiting scrape to 100 urls per message #619

Conversation

ssilare-adobe commented Dec 4, 2024 • edited Loading

Related Issues

github-actions bot commented Dec 4, 2024 • edited Loading

solaris007 left a comment

Choose a reason for hiding this comment

ssilare-adobe commented Dec 4, 2024

blefebvre Dec 5, 2024

Choose a reason for hiding this comment

dipratap Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

blefebvre Dec 5, 2024

Choose a reason for hiding this comment

dipratap commented Dec 9, 2024

dipratap Dec 9, 2024

Choose a reason for hiding this comment

dipratap Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

ssilare-adobe commented Dec 4, 2024 •

edited

Loading

github-actions bot commented Dec 4, 2024 •

edited

Loading

dipratap Dec 9, 2024 •

edited

Loading

dipratap Dec 9, 2024 •

edited

Loading