What exactly do you publish, how do you do it, and what can I do with it?
The part of the LOD Laundromat we use for crawling is called the LOD WashingMachine, and is freely available on Github. The LOD Laundromat GitHub organization contains other interesting repositories as well, from example (fast) parser implementations, to examples on how you might use the LOD Laundromat for analyzing many sets of Linked Data.
- Beek, W. & Rietveld, L & Bazoobandi, H.R. & Wielemaker, J. & Schlobach, S.: LOD Laundromat: A Uniform Way of Publishing Other People’s Dirty Data. Proceedings of the International Semantic Web Conference (2014).
It is widely accepted that proper data publishing is difficult. The majority of Linked Open Data (LOD) does not meet even a core set of data publishing guidelines. Moreover, datasets that are clean at creation, can get stains over time. As a result, the LOD cloud now contains a high level of dirty data that is difficult for humans to clean and for machines to process.
Existing solutions for cleaning data (standards, guidelines, tools) are targeted towards human data creators, who can (and do) choose not to use them. This paper presents the LOD Laundromat, which removes stains from data without any human intervention. This fully automated approach is able to make very large amounts of LOD more easily available for further processing right now.
The LOD Laundromat is not a new dataset, but rather a uniform point of entry to a collection of cleaned siblings of existing datasets. It provides researchers and application developers a wealth of data that is guaranteed to conform to a specified set of best practices, thereby greatly improving the chance of data actually being (re)used
What exactly do you publish?
We publish the crawled data as dump files via the wardrobe as gzipped, sorted, N-Triples and N-Quads, or as indexed and compressed HDT files The wardrobe provides access to Triple Pattern Fragment APIs as well. The provenance and VoiD meta-data is accessible via our SPARQL endpoint
Why do you publish this?
Using and finding Linked Data takes time and effort. We are not yet at a point where all available Linked Data is clean, standard and easy to use. Many datasets contain syntax errors, duplicates, or are difficult to find. We offer one single download location for Linked Data, and publish the Linked Data in a consistent simple (sorted) N-Triple format, making it easy to use and compare datasets
Why Gzipped N-Triples / N-Quads?
Compared to hosting the whole LOD cloud via a SPARQL endpoint, hosting gzipped N-Triples / N-Quads is easy and doable
How do I use these files?
Depends on what you would like to do:
Host as SPARQL endpointSimply download the data, and unzip itHost as Triple Pattern FragmentDownload the HDT file of a dataset, and use that as the TPF backend storage type.Analyze the complete datasetIf you would like to analyze the triples using your own code, make sure you take advantage of the way we publish the data.
- The data is gzipped: you do not have to unpack everything to analyze it. Instead, you can stream it! This will save you a lot of memory usage
- The data is in one single N-Triple / N-Quad format: using this knowledge, you can easily write your own super-fast parser, as you don’t have to consider the complete scope of the N-Triple / N-Quad specification. And even better, we provide a couple of parser implementations for you! (in NodeJs, Java, and Python).
Use/analyze a part of the datasetDepending on what part you would like to use, we suggest you use the Triple Pattern Fragment API. Via this API you are able to e.g. select only those triples where the predicate is
<http://example.com/pred>. If you encounter latency issues, or if you simply prefer to analyze your dataset on your computer locally, then consider downloading the HDT file and issue triple-pattern queries on the file directly on the command-line.