Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: limiting scrape to 100 urls per message #619

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

ssilare-adobe
Copy link
Contributor

@ssilare-adobe ssilare-adobe commented Dec 4, 2024

[SITES-27245] Content scraper exceeding Lambda time limit to scrape

Please ensure your pull request adheres to the following guidelines:

  • make sure to link the related issues in this description. Or if there's no issue created, make sure you
    describe here the problem you're solving.
  • when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes

If the PR is changing the API specification:

  • make sure you add a "Not implemented yet" note the endpoint description, if the implementation is not ready
    yet. Ideally, return a 501 status code with a message explaining the feature is not implemented yet.
  • make sure you add at least one example of the request and response.

If the PR is changing the API implementation or an entity exposed through the API:

  • make sure you update the API specification and the examples to reflect the changes.

If the PR is introducing a new audit type:

  • make sure you update the API specification with the type, schema of the audit result and an example

Related Issues

Thanks for contributing!

Copy link

github-actions bot commented Dec 4, 2024

This PR will trigger a minor release when merged.

Copy link
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please provide context why this change is needed and involve other stakeholders of the scraper cc @blefebvre

@ssilare-adobe
Copy link
Contributor Author

please provide context why this change is needed and involve other stakeholders of the scraper cc @blefebvre

Found an issue with scraper - lambda time limit is 15 mins and scraping is taking more time than that.

@habansal habansal requested a review from dipratap December 5, 2024 04:51
context,
),
];
await Promise.all(promises).then(() => say(`:adobe-run: Triggered scrape run for site \`${baseURL}\``));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:adobe-run: Triggered scrape run for site \${baseURL}`

Should this say happen before the await Promise.all(promises)?

Copy link

@dipratap dipratap Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are saying 'completed triggering scrape runs ..' just after this, can we say here: :adobe-run: Triggering scrape run for site \${baseURL}`` before awaiting for the promises.

const message = `:white_check_mark: Completed triggering scrape runs for site \`${baseURL}\` — Total URLs: ${urls.length}`;

await say(message);
const half = Math.ceil(urls.length / 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For handling a single URL: Does triggerScraperRun handle an empty URLs array gracefully?

@dipratap
Copy link

dipratap commented Dec 9, 2024

Scraping 100 sites still seems like a big task and can exceed the time limits in some cases where the site is slow to respond, should we go with ~50? Less number is safer, with almost no extra cogs.

const message = `:white_check_mark: Completed triggering scrape runs for site \`${baseURL}\` — Total URLs: ${urls.length}`;

await say(message);
const half = Math.ceil(urls.length / 2);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not depend on the count of the urls, lets say in future we start saving more than 200 pages. In that case we would start scraping more than 100 pages in which case the timeout issue can arise again.
Also, if the total top pages is not a large number, IIRC we have 17 for sunstar, we would divide that into 8, 7 and send 2 scrape messages which would be unnecessary.
To avoid these issues, lets process batches of a fixed number of urls, 50 maybe.

context,
),
];
await Promise.all(promises).then(() => say(`:adobe-run: Triggered scrape run for site \`${baseURL}\``));
Copy link

@dipratap dipratap Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are saying 'completed triggering scrape runs ..' just after this, can we say here: :adobe-run: Triggering scrape run for site \${baseURL}`` before awaiting for the promises.

Signed-off-by: Sahil Silare <[email protected]>
@ssilare-adobe ssilare-adobe force-pushed the SITES-27245 branch 2 times, most recently from b5497f3 to 4f7a570 Compare December 13, 2024 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants