Skip to main content

View Post [edit]

Poster: PiRSquared Date: Mar 12, 2018 10:39am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

At least partially saved here: https://archive.today/www.obakemono.com
This post was modified by PiRSquared on 2018-03-12 17:39:04

Reply [edit]

Poster: billybiscuits Date: Oct 22, 2017 2:47pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

This doesn't work for sites without a "www."

Reply [edit]

Poster: rin-q Date: Jan 27, 2015 7:06pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Interesting. Domain squatters and such can already be such a pain when doing research on the Web... That the biggest Internet « archive » prevents access to already archived pages on the basis that the current domain owner's has put up a no-crawler policy and without doing any kind of check-up isn't exactly great. Couldn't there be, at least, a check against current and past domain record data? While the anonymization (to a certain extent, at least) of such records is possible, it could help determine wether or not the robots.txt should be ignored. If anything, I guess this trick can serve as a way to access those already crawled pages while the issue gets sorted out. Thanks a bunch!
This post was modified by rin-q on 2015-01-28 03:06:38

Reply [edit]

Poster: PiRSquared Date: Jan 27, 2015 7:33pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Is whois data archived?

Reply [edit]

Poster: rin-q Date: Jan 28, 2015 10:03am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Well. I am aware of at least one service that, while I haven't personally tried it, provides whois record history. Domaintools being that service, they claim to have archived whois records since 1995 and one can gain access to these for a monthly fee. Now, I wouldn't know wether the Internet Archive has such records (I can only hope so), but another way to, at least partially, check wether or not to respect the robots.txt would be to firstly ignore it and do sort of an integrity check with the last archived content and the current one. If the content is too different, then the robots.txt file should be ignored for already archived content, but not newer one. Obviously, this probably wouldn't work for every cases, but that'd still be a better way to go, if you asked me. Or the robots.txt file could simply prevent new crawls while still allowing visitor access to already crawled content. The current situation feels like a library making a book for consultation only, then erasing the past borrowers memories of the book because of the new consultation only policy. I mean, the data is still there (as you've shown me earlier), why not just allow access to it?
This post was modified by rin-q on 2015-01-28 18:00:57
This post was modified by rin-q on 2015-01-28 18:03:35