Crawl stops on non-www URLs #338

cosmiXs · 2019-01-09T17:25:57Z

What is the current behavior?
If I specify a domain like eg. "http://www.domainname.com/"
but the preferred domain settings on the server are without "www." then the crawling process stops.

The REVERSE is also valid unfortunately if I specify a DOMAIN that has "www" but I do not specify it eg. "http://domainname.com/" the crawling also STOPS.

If the current behavior is a bug, please provide the steps to reproduce

What is the expected behavior?
Normally I expect it to recognize the domain name without "www"

What is the motivation / use case for changing the behavior?

Please tell us about your environment:

Version: latest
Platform / OS version: mac OS X
Node.js version: latest

matheuschimelli · 2019-01-20T19:51:23Z

Sorry, but i cannot understand what you want to say. What you really want to do? If a site don't use www on domain, you have no reason to crawl using www. I think that's not a bug. Please provide more details.

cosmiXs · 2019-01-24T08:26:13Z

I do not know in advance if a domain has explicitly require www. or non-www.
I have a list of domains that I want to crawl, I've placed them into a file and I'm reading them from there. By default I'm putting www. in front of all the domains, but when the crawler reaches a domain that explicitly does not have www. (this is how is forced by Preferred domain server setting) then the crawler only acceses the home page than exits.

simlevesque · 2019-03-07T21:50:57Z

hey cosmiXs I made an npm package to fix this. It's called 'redirect-chain'. You give it your entrypoint url and then it gives you the domain redirect chain. Then use this array as allowedDomains.

https://www.npmjs.com/package/redirect-chain

vycoder · 2019-03-29T03:50:56Z

I'm having the same problem. I get a timeout when visiting a non www url. I can visit it on the browser just fine.

I tried using @simlevesque's solution but I sill get the same problem.

await crawler.queue({
   url,
   allowedDomains: await redirectChain.domains(url);
});

still no luck. I'm getting a Error: Navigation Timeout Exceeded: 30000ms exceeded

kulikalov · 2020-10-17T07:36:37Z

@cosmiXs @vycoder could you provide a full code example to reproduce the issue?

kulikalov added the bug label Oct 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl stops on non-www URLs #338

Crawl stops on non-www URLs #338

cosmiXs commented Jan 9, 2019 •

edited

Loading

matheuschimelli commented Jan 20, 2019

cosmiXs commented Jan 24, 2019

simlevesque commented Mar 7, 2019

vycoder commented Mar 29, 2019 •

edited

Loading

kulikalov commented Oct 17, 2020

Crawl stops on non-www URLs #338

Crawl stops on non-www URLs #338

Comments

cosmiXs commented Jan 9, 2019 • edited Loading

matheuschimelli commented Jan 20, 2019

cosmiXs commented Jan 24, 2019

simlevesque commented Mar 7, 2019

vycoder commented Mar 29, 2019 • edited Loading

kulikalov commented Oct 17, 2020

cosmiXs commented Jan 9, 2019 •

edited

Loading

vycoder commented Mar 29, 2019 •

edited

Loading