Skip to main content

Internet Archive Research Publication Crawls

Internet Archive Web Group

A series of open web crawls targeting journal articles, technical memos, essays, datasets, and other research publications.



rss RSS

21,170
RESULTS


Show sorted alphabetically

Show sorted alphabetically

SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
OAI-PMH-CRAWL-2020-06
OAI-PMH-CRAWL-2020-06
collection
2,946
ITEMS
5.4M
VIEWS
by Internet Archive Web Group
collection

eye 5.4M

MSAG-PDF-CRAWL-2017
collection
1,855
ITEMS
12.2M
VIEWS
by Internet Archive Web Group
collection

eye 12.2M

Microsoft Academic Graph public corpus (Feb 2016) PDF URLs, filtered to remove large sites (pubmed, citeseerx, arxiv) and already-crawled URLs.
Topics: papers, journals
UNPAYWALL-PDF-CRAWL-2018-07
UNPAYWALL-PDF-CRAWL-2018-07
collection
1,241
ITEMS
15M
VIEWS
by Internet Archive Web Group
collection

eye 15M

Web archive data from a crawl of open access PDF URLs provided by Unpaywall.
OA-JOURNAL-CRAWL-2020-07
OA-JOURNAL-CRAWL-2020-07
collection
1,923
ITEMS
10.1M
VIEWS
by Internet Archive Web Group
collection

eye 10.1M

Open Access Journal Test Crawl (2018)
Open Access Journal Test Crawl (2018)
collection
794
ITEMS
11.1M
VIEWS
by Internet Archive Web Group
collection

eye 11.1M

UNPAYWALL-PDF-CRAWL-2019-04
UNPAYWALL-PDF-CRAWL-2019-04
collection
641
ITEMS
5.6M
VIEWS
by Internet Archive Web Group
collection

eye 5.6M

DIRECT-OA-CRAWL-2019
DIRECT-OA-CRAWL-2019
collection
2,566
ITEMS
5.3M
VIEWS
by Internet Archive Web Group
collection

eye 5.3M

MAG-PDF-CRAWL-2020-03
MAG-PDF-CRAWL-2020-03
collection
489
ITEMS
3.9M
VIEWS
by Internet Archive Web Group
collection

eye 3.9M

CORE-UPSTREAM-CRAWL-2018-11
CORE-UPSTREAM-CRAWL-2018-11
collection
741
ITEMS
1.7M
VIEWS
by Internet Archive Web Group
collection

eye 1.7M

Crawl of "upstream" URLs from CORE (core.ac.uk) metadata dump. Only a partial seedlist of files crawled.
DATACITE-DOI-CRAWL-2020-01
DATACITE-DOI-CRAWL-2020-01
collection
1,417
ITEMS
3.8M
VIEWS
by Internet Archive Web Group
collection

eye 3.8M

OA-DOI-CRAWL-2020-02
OA-DOI-CRAWL-2020-02
collection
278
ITEMS
3.3M
VIEWS
by Internet Archive Web Group
collection

eye 3.3M

UNPAYWALL-PDF-CRAWL-2020-03
UNPAYWALL-PDF-CRAWL-2020-03
collection
344
ITEMS
1.9M
VIEWS
by Internet Archive Web Group
collection

eye 1.9M

JOURNALS-PATCH-CRAWL-2022-01
JOURNALS-PATCH-CRAWL-2022-01
collection
104
ITEMS
723,165
VIEWS
collection

eye 723,165

MAG-PDF-CRAWL-2021-08
MAG-PDF-CRAWL-2021-08
collection
189
ITEMS
762,930
VIEWS
collection

eye 762,930

MAG-PDF-CRAWL-2020-07
MAG-PDF-CRAWL-2020-07
collection
196
ITEMS
1.7M
VIEWS
by Internet Archive Web Group
collection

eye 1.7M

UNPAYWALL-PDF-CRAWL-2021-07
UNPAYWALL-PDF-CRAWL-2021-07
collection
174
ITEMS
1M
VIEWS
collection

eye 1M

UNPAYWALL-PDF-CRAWL-2020-11
UNPAYWALL-PDF-CRAWL-2020-11
collection
199
ITEMS
1.7M
VIEWS
by Internet Archive Web Group
collection

eye 1.7M

Wide Web Targeted PDF Crawling (2017)
Wide Web Targeted PDF Crawling (2017)
collection
922
ITEMS
3.1M
VIEWS
by Internet Archive Web Group
collection

eye 3.1M

DOI-LANDING-CRAWL-2018-06
DOI-LANDING-CRAWL-2018-06
collection
279
ITEMS
3.3M
VIEWS
by Internet Archive Web Group
collection

eye 3.3M

UNPAYWALL-PDF-CRAWL-2020-05
UNPAYWALL-PDF-CRAWL-2020-05
collection
282
ITEMS
1.7M
VIEWS
by Internet Archive Web Group
collection

eye 1.7M

SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
collection
1,011
ITEMS
1.4M
VIEWS
by Internet Archive Web Group
collection

eye 1.4M

OA-DOI-CRAWL-2020-12
OA-DOI-CRAWL-2020-12
collection
191
ITEMS
1.5M
VIEWS
by Internet Archive Web Group
collection

eye 1.5M

PLATFORM-CRAWL-2020
PLATFORM-CRAWL-2020
collection
649
ITEMS
460,547
VIEWS
by Internet Archive Web Group
collection

eye 460,547

OA-JOURNAL-CRAWL-2019-08
OA-JOURNAL-CRAWL-2019-08
collection
201
ITEMS
2.8M
VIEWS
by Internet Archive Web Group
collection

eye 2.8M

TARGETED-ARTICLE-CRAWL-2022-04
TARGETED-ARTICLE-CRAWL-2022-04
collection
219
ITEMS
268,235
VIEWS
collection

eye 268,235

collection

eye 1.9M

IA crawl of PDF urls provided by Semantic Scholar.
Topic: pdf
UNPAYWALL-PDF-CRAWL-2021-05
UNPAYWALL-PDF-CRAWL-2021-05
collection
123
ITEMS
906,382
VIEWS
by Internet Archive Web Group
collection

eye 906,382

CiteSeerX URL Crawl 2017
CiteSeerX URL Crawl 2017
collection
207
ITEMS
1.2M
VIEWS
collection

eye 1.2M

A targeted crawl to fetch research publications from the public web which have been crawled by CiteSeerX but have not previously been crawled by the Internet Archive.
Topics: scholarly, papers, journal
OAI-PMH-PATCH-CRAWL-2021-12
OAI-PMH-PATCH-CRAWL-2021-12
collection
75
ITEMS
334,122
VIEWS
collection

eye 334,122

DOAJ-CRAWL-2020-11
DOAJ-CRAWL-2020-11
collection
102
ITEMS
903,106
VIEWS
by Internet Archive Web Group
collection

eye 903,106

DOI-CRAWL-2022-02
DOI-CRAWL-2022-02
collection
25
ITEMS
204,577
VIEWS
collection

eye 204,577

JOURNAL-HOMEPAGE-CRAWL-2022-03
JOURNAL-HOMEPAGE-CRAWL-2022-03
collection
44
ITEMS
252,591
VIEWS
collection

eye 252,591

PubMed Central Crawl (2019-10)
PubMed Central Crawl (2019-10)
collection
216
ITEMS
431,481
VIEWS
by Internet Archive Web Group
collection

eye 431,481

PUBMEDCENTRAL-CRAWL-2020-02
PUBMEDCENTRAL-CRAWL-2020-02
collection
108
ITEMS
249,540
VIEWS
by Internet Archive Web Group
collection

eye 249,540

arXiv Content Crawl (2019-10)
arXiv Content Crawl (2019-10)
collection
37
ITEMS
72,947
VIEWS
by Internet Archive Web Group
collection

eye 72,947

ARXIV-PUBMEDCENTRAL-CRAWL-2020-04
ARXIV-PUBMEDCENTRAL-CRAWL-2020-04
collection
60
ITEMS
108,871
VIEWS
by Internet Archive Web Group
collection

eye 108,871

TARGETED-ARTICLE-CRAWL-2022-03
TARGETED-ARTICLE-CRAWL-2022-03
collection
9
ITEMS
51,601
VIEWS
collection

eye 51,601

SCIELO-CRAWL-2020-07
SCIELO-CRAWL-2020-07
collection
41
ITEMS
195,586
VIEWS
by Internet Archive Web Group
collection

eye 195,586

UNPAYWALL-PDF-CRAWL-2022-04
UNPAYWALL-PDF-CRAWL-2022-04
collection
38
ITEMS
17,825
VIEWS
collection

eye 17,825

JOURNALS-PATCH-CRAWL-2022-01
web

eye 19,665

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Wed Feb 2 04:32:21 PST 2022 to Wed Feb 2 06:24:58 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 18,075

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Wed Feb 9 06:43:52 PST 2022 to Wed Feb 9 06:06:53 PST 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 22,245

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Fri Mar 4 08:19:11 PST 2022 to Tue Mar 8 18:29:43 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 21,121

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Sun Jan 16 16:05:54 PST 2022 to Sun Jan 16 16:33:31 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 14,005

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Wed Feb 9 12:34:39 PST 2022 to Wed Feb 9 13:13:37 PST 2022.
Topic: crawldata
DOAJ-CRAWL-2020-11
web

eye 93,549

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Tue Nov 24 17:59:21 PST 2020 to Tue Nov 24 11:43:19 PST 2020.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 23,169

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Wed Feb 23 02:01:38 PST 2022 to Wed Feb 23 15:48:40 PST 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 13,164

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Sat Feb 26 14:02:15 PST 2022 to Sun Feb 27 05:47:42 PST 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 11,724

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Wed Mar 2 07:41:16 PST 2022 to Thu Mar 3 05:41:51 PST 2022.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2019-04
web

eye 15,102

favorite 0

comment 0

Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:UNPAYWALL-PDF-CRAWL-2019-04 from Sun Apr 28 11:25:03 PDT 2019 to Sun Apr 28 05:46:48 PDT 2019.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2018-07
web

eye 47,802

favorite 0

comment 0

Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:UNPAYWALL-PDF-CRAWL-2018-07 from Sun Jul 29 09:54:12 PDT 2018 to Sun Jul 29 04:01:42 PDT 2018.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 11,026

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Thu Feb 24 14:01:46 PST 2022 to Fri Feb 25 11:54:58 PST 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 11,392

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Sun Feb 27 13:18:39 PST 2022 to Mon Feb 28 05:15:19 PST 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 11,162

favorite 1

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Fri Feb 25 14:02:24 PST 2022 to Sat Feb 26 06:00:57 PST 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 11,369

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Tue Mar 8 20:50:17 PST 2022 to Wed Mar 9 18:29:43 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 11,209

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Wed Feb 9 19:49:10 PST 2022 to Wed Feb 9 17:48:49 PST 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 10,257

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Tue Mar 1 07:52:41 PST 2022 to Wed Mar 2 05:33:50 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 15,424

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Sun Jan 16 23:09:54 PST 2022 to Sun Jan 16 23:27:17 PST 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 15,398

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Wed Feb 23 18:50:55 PST 2022 to Thu Feb 24 11:23:51 PST 2022.
Topic: crawldata
OA-DOI-CRAWL-2020-12
web

eye 35,952

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:OA-DOI-CRAWL-2020-12 from Wed Dec 9 22:59:12 PST 2020 to Wed Dec 9 15:45:33 PST 2020.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 10,532

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Sat Feb 5 14:19:42 PST 2022 to Sat Feb 5 15:31:51 PST 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 9,992

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Thu Mar 3 07:55:41 PST 2022 to Fri Mar 4 06:00:57 PST 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 9,205

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Fri Mar 11 02:31:16 PST 2022 to Sun Mar 13 07:29:43 PDT 2022.
Topic: crawldata
TARGETED-ARTICLE-CRAWL-2022-03
web

eye 21,516

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:TARGETED-ARTICLE-CRAWL-2022-03 from Sat Mar 12 03:03:34 PST 2022 to Sat Mar 12 19:02:19 PST 2022.
Topic: crawldata
JOURNAL-HOMEPAGE-CRAWL-2022-03
web

eye 11,675

favorite 0

comment 0

Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc279.us.archive.org:JOURNAL-HOMEPAGE-CRAWL-2022-03 from Wed Mar 30 20:42:30 PDT 2022 to Thu Mar 31 18:02:39 PDT 2022.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 10,008

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Mon Feb 28 07:53:39 PST 2022 to Tue Mar 1 05:36:50 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 12,215

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Sun Jan 16 08:47:48 PST 2022 to Sun Jan 16 09:46:08 PST 2022.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2018-07
web

eye 79,064

favorite 0

comment 0

Internet Archive crawldata of open access journal content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2018-07 from Sun Jul 29 09:53:16 PDT 2018 to Sun Jul 29 04:27:27 PDT 2018.
Topic: crawldata
DOI-CRAWL-2022-02
web

eye 9,019

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:DOI-CRAWL-2022-02 from Wed Mar 9 20:55:55 PST 2022 to Fri Mar 11 02:34:43 PST 2022.
Topic: crawldata
JOURNAL-HOMEPAGE-CRAWL-2022-03
web

eye 14,420

favorite 0

comment 0

Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc279.us.archive.org:JOURNAL-HOMEPAGE-CRAWL-2022-03 from Thu Mar 31 20:37:17 PDT 2022 to Fri Apr 1 13:06:40 PDT 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 8,306

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Fri Jan 28 18:32:02 PST 2022 to Fri Jan 28 18:24:26 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 10,426

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Fri Feb 4 02:18:39 PST 2022 to Fri Feb 4 01:48:51 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 10,280

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Tue Feb 1 19:10:45 PST 2022 to Tue Feb 1 22:31:42 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 9,802

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Thu Feb 3 04:12:12 PST 2022 to Thu Feb 3 03:46:54 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 8,962

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Sat Feb 5 06:11:43 PST 2022 to Sat Feb 5 08:08:47 PST 2022.
Topic: crawldata
JOURNALS-PATCH-CRAWL-2022-01
web

eye 9,202

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:JOURNALS-PATCH-CRAWL-2022-01 from Sat Feb 5 21:54:41 PST 2022 to Sat Feb 5 22:00:45 PST 2022.
Topic: crawldata