Editing
Wayback Machine
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Website exclusion policy=== Historically, the Wayback Machine has respected the [[robots exclusion standard]] (robots.txt) in determining if a website would be crawled – or if already crawled, if its archives would be publicly viewable. Website owners had the option to opt out of Wayback Machine through the use of robots.txt. It applied robots.txt rules retroactively; if a site blocked the Internet Archive, any previously archived pages from the domain were immediately rendered unavailable as well. In addition, the Internet Archive stated that "Sometimes, a website owner will contact us directly and ask us to stop crawling or archiving a site. We comply with these requests."<ref>{{cite web|url=https://web.archive.org/collections/web/faqs.html#exclusions |title=FAQs – Some sites are not available because of Robots.txt or other exclusions. What does that mean? |website=Internet Archive Wayback Machine |archive-url=https://web.archive.org/web/20110415130934/https://web.archive.org/collections/web/faqs.html#exclusions |archive-date=April 15, 2011}}</ref> In addition, the website says: "The Internet Archive is not interested in preserving or offering access to Web sites or other internet documents of persons who do not want their materials in the collection."<ref>{{cite web|url=https://www.archive.org/about/faqs.php#2 |title= Frequently Asked Questions |website=Internet Archive |archive-url=https://web.archive.org/web/20140417122600/https://archive.org/about/faqs.php |archive-date=April 17, 2014|url-status=dead}}</ref><ref>{{cite news |url=https://motherboard.vice.com/en_us/article/nekzzq/wayback-machine-deleting-evidence-flexispy |website=Vice |title=The Wayback Machine Is Deleting Evidence of Malware Sold to Stalkers |last=Cox |first=Joseph |date=May 22, 2018 |access-date=May 23, 2018 |archive-url=https://archive.today/20180522192132/https://motherboard.vice.com/en_us/article/nekzzq/wayback-machine-deleting-evidence-flexispy |archive-date=May 22, 2018 |url-status=live}}{{cbignore}}</ref> On April 17, 2017, reports surfaced of sites that had gone defunct and became [[parked domain]]s that were using robots.txt to exclude themselves from search engines, resulting in them being inadvertently excluded from the Wayback Machine.<ref>{{cite web |title=Robots.txt meant for search engines don't work well for web archives |url=https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ |website=Internet Archive |date=April 17, 2017 |access-date=June 29, 2019}}</ref> Following this, the Internet Archive changed the policy to require an explicit exclusion request to remove sites from the Wayback Machine.<ref name="Using" /> ====The Oakland Archive Policy==== Wayback's retroactive exclusion policy is based in part upon ''Recommendations for Managing Removal Requests and Preserving Archival Integrity'', known as ''The Oakland Archive Policy'', published by the School of Information Management and Systems at [[University of California, Berkeley]] in 2002, which gives a website owner the right to block access to the site's archives.<ref>{{cite web |title=Recommendations for Managing Removal Requests And Preserving Archival Integrity |date=December 14, 2002 |publisher=[[University of California]] |url=http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html |access-date=October 20, 2024 |url-status=dead |archive-url=https://web.archive.org/web/20030502165937/http://sims.berkeley.edu/research/conferences/aps/removal-policy.html |archive-date=May 2, 2003}}</ref> Wayback has complied with this policy to help avoid expensive litigation.<ref>{{cite web |title=Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |date=July 7, 2014 |publisher=Internet Archive |url=https://archive.org/post/1019415/retroactive-robotstxt-removal-of-past-crawls-aka-oakland-archive-policy |access-date=September 14, 2017 |url-status=live |archive-url=https://web.archive.org/web/20171010124036/https://archive.org/post/1019415/retroactive-robotstxt-removal-of-past-crawls-aka-oakland-archive-policy |archive-date=October 10, 2017 }}</ref> The Wayback retroactive exclusion policy began to relax in 2017, when it stopped honoring robots on U.S. government and military web sites for both crawling and displaying web pages. As of April 2017, Wayback is ignoring robots.txt more broadly, not just for U.S. government websites.<ref>{{cite web |url=http://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ |title=Robots.txt meant for search engines don't work well for web archives |work=Internet Archive Blogs |first=Mark |last=Graham |date=April 17, 2017 |access-date=April 16, 2017 |url-status=live |archive-url=https://web.archive.org/web/20170417131508/http://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ |archive-date=April 17, 2017}}</ref><ref>{{cite web |title=Archivierung des Internets: Internet Archive ignoriert künftig robots.txt |date=April 25, 2017 |url=https://www.heise.de/newsticker/meldung/Archivierung-des-Internets-Internet-Archive-ignoriert-kuenftig-robots-txt-3693558.html |publisher=heise online |access-date=May 14, 2017 |language=de |url-status=live |archive-url=https://web.archive.org/web/20170427035659/https://www.heise.de/newsticker/meldung/Archivierung-des-Internets-Internet-Archive-ignoriert-kuenftig-robots-txt-3693558.html |archive-date=April 27, 2017}}</ref><ref>{{cite web |title=Suchmaschinen: Internet Archive will künftig Robots.txt-Einträge ignorieren – Golem.de |url=https://www.golem.de/news/suchmaschinen-internet-archive-will-kuenftig-robots-txt-eintraege-ignorieren-1704-127446.html |access-date=May 14, 2017 |language=de |url-status=live |archive-url=https://web.archive.org/web/20170619210648/https://www.golem.de/news/suchmaschinen-internet-archive-will-kuenftig-robots-txt-eintraege-ignorieren-1704-127446.html |archive-date=June 19, 2017}}</ref><ref>{{cite news |title=Internet Archive will ignore robots.txt files to keep historical record accurate |url=https://www.digitaltrends.com/computing/internet-archive-robots-txt/ |newspaper=Digital Trends |access-date=May 14, 2017 |date=April 24, 2017 |url-status=live |archive-url=https://web.archive.org/web/20170516130029/https://www.digitaltrends.com/computing/internet-archive-robots-txt/ |archive-date=May 16, 2017}}</ref>
Summary:
By saving changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Community portal
Current events
Recent changes
Random page
Help
Donate
Tools
What links here
Related changes
Upload file
Special pages
Page information