-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl stops on non-www URLs #338
Comments
Sorry, but i cannot understand what you want to say. What you really want to do? If a site don't use www on domain, you have no reason to crawl using www. I think that's not a bug. Please provide more details. |
I do not know in advance if a domain has explicitly require www. or non-www. |
hey cosmiXs I made an npm package to fix this. It's called 'redirect-chain'. You give it your entrypoint url and then it gives you the domain redirect chain. Then use this array as allowedDomains. |
I'm having the same problem. I get a timeout when visiting a non www url. I can visit it on the browser just fine. I tried using @simlevesque's solution but I sill get the same problem. await crawler.queue({
url,
allowedDomains: await redirectChain.domains(url);
}); still no luck. I'm getting a |
What is the current behavior?
If I specify a domain like eg. "http://www.domainname.com/"
but the preferred domain settings on the server are without "www." then the crawling process stops.
The REVERSE is also valid unfortunately if I specify a DOMAIN that has "www" but I do not specify it eg. "http://domainname.com/" the crawling also STOPS.
If the current behavior is a bug, please provide the steps to reproduce
What is the expected behavior?
Normally I expect it to recognize the domain name without "www"
What is the motivation / use case for changing the behavior?
Please tell us about your environment:
The text was updated successfully, but these errors were encountered: