arXiv data snapshot – Teaching Lab

The arXiv data is an open dataset containing PDF files of non-reviewed academic papers. Theoretically it can be accessed directly (both as a bulk and using a URL to get specific PDF file).

As a backup, I created on (2023-09) a snapshot containing 1TB of PDF files (about 3/4 of the total space used by the dataset) and stored it in Azure storage.

There is a Metadata JSON file containing info for each paper. This json can be accessed from the same storage account.

Accessing the data

The data is stored in Azure storage account. You need to connect to a virtual machine in the same area as the storage and mount the storage.

Mounting Instructions

[TBD]