Large-Scale Data Systems (1): DataSet
To explore large-scale data systems, from storage to processing, from the interface to implementation, I'll be writing a variety of articles.
This is the first one -- Dataset.
Logically, we treat all the data to be processed as a dataset.
Representation
A dataset can be represented in a variety of ways:
Tuples
Nodes and Edges
KV pairs
Documents
Files
Objects
Messages
Logs
They all relate to a certain kind of storage:
Relational Database
Graph Database
KV Store
Document Storage
File System
Object Store
Message Queue
Log System
Each type of storage has a certain user interface and may be applied in particular circumstances. But at a high level, they may all be thought of as datasets made up of some sort of fundamental data component.
We refer to this data component as objects uniformly across this series.
Partition
To scatter the logical dataset to different machines, we should first split the dataset into serval parts, which are called shards, partitions or splits. As a result, it forms a three-level abstraction:
DataSet - Partition - Object