List of data catalogs
Let Google Search work for you!
Take a look at some more datasets here: Datasets to be included
if media types are described, the following legends holds:
a – audio
g – graphs
t – text (all kinds of)
p – images
v – video
Color code of catalog quality: unknown, low, medium, high
(low means you should not blindly rely on the correctness of the data. It is not about the data format). The ranking is based on subjective assessment of Noam Cohen
| Name | link | number of datasets | media types | comment |
| Open data | index.okfn.org | 15 topics | t | |
| Technion | library.technion.ac.il/he/libraries-worldwide/ | |||
| Semantic Scholar | api.semanticscholar.org/ | 1 | g | downloadable archive of (meta data) of scientific papers.
200M scientific papers |
| UCI machine learning repository | https://archive.ics.uci.edu/ml//index.php | 600 | pt | |
| Kaggle | link | ??pt? | good quality datasets since they are verified by kaggle. | |
| github | github.com/datasets/awesome-data | a list of interesting datasets, and particularly the “awsome-data” list.
Accessing the data is using https://datahub.io/collections which requires sign up and might be broken |
||
| FiveThirtyEight | https://github.com/fivethirtyeight | hundreds | t | news and sports platform , open data related to USA |
| AWS public Datasets | https://registry.opendata.aws/ | 300 | ??pt | high quality datasets, mainly from large organizations and governments. |
| Data.world | data.world/datasets/open-data | 134K | “the social network for data professionals. “ | |
| Buzzfeed News | https://github.com/BuzzFeedNews/everything | 5 | t | |
| Google archive of datasets | www.tensorflow.org/datasets/catalog/overview | hundreds | agtv | high quality datasets |
| Academic Torrents | https://academictorrents.com/browse.php | 2400 | agtpv | for sharing datasets from scientific papers
Distributed system for sharing big datasets for researchers. |
| Our World in data | https://ourworldindata.org https://github.com/owid/owid-datasets |
hundreds | t | small datasets, mainly time series |
| NY city data | opendata.cityofnewyork.us/ https://data.cityofnewyork.us/browse |
3500 | anything the city thinks is worth of keeping. Updated daily |
|
| HuggingFace | huggingface.co/datasets | 5000 | t | |
MIMIC-III |
https://www.nature.com/articles/sdata201635 | 50000 | tp | a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers[…] |
Zenodo.org |
https://zenodo.org/ | 8000 | Zenodo is a general-purpose open repository.
It allows researchers to deposit research papers, data sets, research software, reports, and any other research related digital artefacts. |
|
arxiv.org |
local arxiv page | 1 | t | arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science[…] |
Government catalogs
| State | link | number of data sets | media types | comment |
| UK | https://en.wikipedia.org/wiki/Data.gov.uk | |||
| USA | https://en.wikipedia.org/wiki/Data.gov | |||
| France | www.data.gouv.fr/fr/datasets/ | 37K | ||
| India | data.gov.in | 10K CATALOGS | ||
| Israel (למ”ס) | https://www.cbs.gov.il/he/Statistics/Pages | |||