Skip to main content

View Post [edit]

Poster: Detective John Carter of Mars Date: Dec 27, 2011 3:01pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

@http://www.archive.org/about/faqs.php#2
"The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection."

Reply [edit]

Poster: PiRSquared Date: Sep 6, 2014 9:20pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

What if the domain is squatted/taken over by another person?

Reply [edit]

Poster: d0c5i5 Date: Jan 21, 2015 2:32pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

aarrrgggg...

Why hasn't this been fixed? I used to find so many things that I can't find because these domain pirates are buying up barely used/forgotten/lapsed domain names and often put in robots.txt (along with countless USELESS ads to nowhere)...

Look, I love collecting old hardware or resurrecting old hardware from countless places and doing stuff with them. Like so many many linux/GNU projects there may be few or scare references to how it was done, pieces of code, or even small downloads that are completely worthy of being preserved, but as the hardware ages (or the authors literally die), this data gets erased from history and I'm often left with links to source code/downloads/whatever refernced in forums that point to what was free/open data (even LICENSED as distributable, if GNU/GPL applies, so I doubt the new owner trying to make a buck off all the people that could end up on the domain they snached has any more claim than I do)....

Hmmm... If I were to name my kid "Disney", and disney died/forgot to fill out a form, etc, would/could I ever wipe out all of the Disney movies from history?

Reply [edit]

Poster: PiRSquared Date: Jan 21, 2015 2:51pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I think it would make sense to show data from squatted domains, even if the current owner forbids it via robots.txt. Anyway, robots.txt was meant to prevent crawlers from visiting a site, but we're talking about displaying already-crawled data. Do you have a specific example site?

Reply [edit]

Poster: rin-q Date: Jan 24, 2015 6:23pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Stumbled upon this discussion while searching for an old encyclopedia of Japanese folklore monsters which domain hasn't been renewed, and a way to gain access to these older entries from before the website disappeared.

So the domain has been bought by a reseller, and since a robots.txt file has been added, none of the information that was available two years ago can be reached via the Wayback Machine.

So a good example website would be obakemono dot com.

A big loss for those interested in Japanese folklore, sadly.

Reply [edit]

Poster: PiRSquared Date: Mar 12, 2018 10:39am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

At least partially saved here: https://archive.today/www.obakemono.com
This post was modified by PiRSquared on 2018-03-12 17:39:04

Reply [edit]

Poster: billybiscuits Date: Oct 22, 2017 2:47pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

This doesn't work for sites without a "www."

Reply [edit]

Poster: rin-q Date: Jan 27, 2015 7:06pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Interesting. Domain squatters and such can already be such a pain when doing research on the Web... That the biggest Internet « archive » prevents access to already archived pages on the basis that the current domain owner's has put up a no-crawler policy and without doing any kind of check-up isn't exactly great. Couldn't there be, at least, a check against current and past domain record data? While the anonymization (to a certain extent, at least) of such records is possible, it could help determine wether or not the robots.txt should be ignored. If anything, I guess this trick can serve as a way to access those already crawled pages while the issue gets sorted out. Thanks a bunch!
This post was modified by rin-q on 2015-01-28 03:06:38

Reply [edit]

Poster: PiRSquared Date: Jan 27, 2015 7:33pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Is whois data archived?

Reply [edit]

Poster: rin-q Date: Jan 28, 2015 10:03am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Well. I am aware of at least one service that, while I haven't personally tried it, provides whois record history. Domaintools being that service, they claim to have archived whois records since 1995 and one can gain access to these for a monthly fee. Now, I wouldn't know wether the Internet Archive has such records (I can only hope so), but another way to, at least partially, check wether or not to respect the robots.txt would be to firstly ignore it and do sort of an integrity check with the last archived content and the current one. If the content is too different, then the robots.txt file should be ignored for already archived content, but not newer one. Obviously, this probably wouldn't work for every cases, but that'd still be a better way to go, if you asked me. Or the robots.txt file could simply prevent new crawls while still allowing visitor access to already crawled content. The current situation feels like a library making a book for consultation only, then erasing the past borrowers memories of the book because of the new consultation only policy. I mean, the data is still there (as you've shown me earlier), why not just allow access to it?
This post was modified by rin-q on 2015-01-28 18:00:57
This post was modified by rin-q on 2015-01-28 18:03:35

Reply [edit]

Poster: d0c5i5 Date: Feb 21, 2015 1:26pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I'm glad to see this discussion is on-going. I'm going to create some scripter to replace my archive look-ups with that little trick, so I can access those records with a click if I come across them.

Regarding how this should be handled, imho, is that robots.txt should only be honored at crawl time. Period. (Esp if they didn't include the robots.txt back on the crawled date)

If someone wants to remove OLD data for a domain they now own AND they owned in the past, then they should do the leg work. Archive.org could offer a service where if you provide specific proof of ownership, possibly a legitimate claim for why it should be removed, and perhaps a fee to pay a trusted 3rd party to evaluate your request, then and only then, should they consider removing the records.

I just think about this, and fast forward 50 years, and they amount of both unintentional and intentional censorship that will happen, and it makes me sad. I know we are moving into the future, but I think archive.org is one of the shining examples of why the past matters, and it shouldn't be wiped away without a reason.

my 2c,
d0c