Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: respect user configuration for work with status codes #812

Merged
merged 4 commits into from
Dec 18, 2024

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Dec 13, 2024

Description

Fixes HTTP status code configuration behavior:

  • additional_http_error_status_codes: Properly triggers errors and retries when specified status codes are encountered
  • ignore_http_error_status_codes: Correctly treats specified error status codes as successful responses

Issues

@Mantisus
Copy link
Collaborator Author

Mantisus commented Dec 13, 2024

I have already discussed this with @janbuchar. This PR should not introduce any breaking changes. If additional_http_error_status_codes and ignore_http_error_status_codes are not specified, the previous behavior is preserved.

Status codes were previously discussed in PR: #498

However, we have differences with the JS version of crawlee:

Python:

  • 400-499 codes cause errors without retries

JS:

  • 400-499 codes are treated as successful
  • Exceptions:
    • 401, 403, 429 - cause errors with retries
    • 406 - causes error without retries and ignores ignoreHttpErrorStatusCodes parameter

@Mantisus Mantisus self-assigned this Dec 13, 2024
@Mantisus Mantisus marked this pull request as ready for review December 13, 2024 12:43
@Mantisus Mantisus requested review from janbuchar, vdusek and Pijukatel and removed request for vdusek December 13, 2024 13:21
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions, otherwise LGTM... much better!

src/crawlee/sessions/_session.py Show resolved Hide resolved
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one last naming thing, otherwise LGTM.

src/crawlee/http_clients/_base.py Outdated Show resolved Hide resolved
@vdusek
Copy link
Collaborator

vdusek commented Dec 16, 2024

However, we have differences with the JS version of crawlee:

Python:

400-499 codes cause errors without retries
JS:

400-499 codes are treated as successful
Exceptions:
401, 403, 429 - cause errors with retries
406 - causes error without retries and ignores ignoreHttpErrorStatusCodes parameter

Any ideas how we want to align it? I believe we do not have issue for that, correct? cc @janbuchar

@Mantisus Mantisus requested a review from janbuchar December 17, 2024 02:15
@janbuchar
Copy link
Collaborator

janbuchar commented Dec 17, 2024

However, we have differences with the JS version of crawlee:
Python:
400-499 codes cause errors without retries
JS:
400-499 codes are treated as successful
Exceptions:
401, 403, 429 - cause errors with retries
406 - causes error without retries and ignores ignoreHttpErrorStatusCodes parameter

Any ideas how we want to align it? I believe we do not have issue for that, correct? cc @janbuchar

I believe apify/crawlee#812 tracks this. I also think that what we do in this PR is the correct way to handle things. The "success on 4xx" in JS is a crutch that prevented retries on such errors, and we can do that more cleanly now (in both implementations).

How does the Python version handle 406 again? Can it be ignored via ignoreHttpErrorStatusCodes?

@Mantisus Mantisus requested a review from janbuchar December 17, 2024 13:47
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing - it looks like there are no tests for this behavior - could you add some? test_http_crawler would be an appropriate place I guess.

src/crawlee/basic_crawler/_basic_crawler.py Outdated Show resolved Hide resolved
@janbuchar janbuchar self-requested a review December 18, 2024 16:41
@janbuchar janbuchar merged commit 8daf4bd into apify:master Dec 18, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants