Skip to main content

View Post [edit]

Poster: PiRSquared Date: Jan 21, 2015 2:51pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I think it would make sense to show data from squatted domains, even if the current owner forbids it via robots.txt. Anyway, robots.txt was meant to prevent crawlers from visiting a site, but we're talking about displaying already-crawled data. Do you have a specific example site?

Reply [edit]

Poster: rin-q Date: Jan 24, 2015 6:23pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Stumbled upon this discussion while searching for an old encyclopedia of Japanese folklore monsters which domain hasn't been renewed, and a way to gain access to these older entries from before the website disappeared.

So the domain has been bought by a reseller, and since a robots.txt file has been added, none of the information that was available two years ago can be reached via the Wayback Machine.

So a good example website would be obakemono dot com.

A big loss for those interested in Japanese folklore, sadly.

Reply [edit]

Poster: PiRSquared Date: Mar 12, 2018 10:39am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

At least partially saved here: https://archive.today/www.obakemono.com
This post was modified by PiRSquared on 2018-03-12 17:39:04

Reply [edit]

Poster: billybiscuits Date: Oct 22, 2017 2:47pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

This doesn't work for sites without a "www."

Reply [edit]

Poster: rin-q Date: Jan 27, 2015 7:06pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Interesting. Domain squatters and such can already be such a pain when doing research on the Web... That the biggest Internet « archive » prevents access to already archived pages on the basis that the current domain owner's has put up a no-crawler policy and without doing any kind of check-up isn't exactly great. Couldn't there be, at least, a check against current and past domain record data? While the anonymization (to a certain extent, at least) of such records is possible, it could help determine wether or not the robots.txt should be ignored. If anything, I guess this trick can serve as a way to access those already crawled pages while the issue gets sorted out. Thanks a bunch!
This post was modified by rin-q on 2015-01-28 03:06:38

Reply [edit]

Poster: PiRSquared Date: Jan 27, 2015 7:33pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Is whois data archived?

Reply [edit]

Poster: rin-q Date: Jan 28, 2015 10:03am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Well. I am aware of at least one service that, while I haven't personally tried it, provides whois record history. Domaintools being that service, they claim to have archived whois records since 1995 and one can gain access to these for a monthly fee. Now, I wouldn't know wether the Internet Archive has such records (I can only hope so), but another way to, at least partially, check wether or not to respect the robots.txt would be to firstly ignore it and do sort of an integrity check with the last archived content and the current one. If the content is too different, then the robots.txt file should be ignored for already archived content, but not newer one. Obviously, this probably wouldn't work for every cases, but that'd still be a better way to go, if you asked me. Or the robots.txt file could simply prevent new crawls while still allowing visitor access to already crawled content. The current situation feels like a library making a book for consultation only, then erasing the past borrowers memories of the book because of the new consultation only policy. I mean, the data is still there (as you've shown me earlier), why not just allow access to it?
This post was modified by rin-q on 2015-01-28 18:00:57
This post was modified by rin-q on 2015-01-28 18:03:35