Large-Scale Data Systems (1): DataSet

·

1 min read

To explore large-scale data systems, from storage to processing, from the interface to implementation, I'll be writing a variety of articles.

This is the first one -- Dataset.

Logically, we treat all the data to be processed as a dataset.

Representation

A dataset can be represented in a variety of ways:

  1. Tuples

  2. Nodes and Edges

  3. KV pairs

  4. Documents

  5. Files

  6. Objects

  7. Messages

  8. Logs

They all relate to a certain kind of storage:

  1. Relational Database

  2. Graph Database

  3. KV Store

  4. Document Storage

  5. File System

  6. Object Store

  7. Message Queue

  8. Log System

Each type of storage has a certain user interface and may be applied in particular circumstances. But at a high level, they may all be thought of as datasets made up of some sort of fundamental data component.

We refer to this data component as objects uniformly across this series.

Partition

To scatter the logical dataset to different machines, we should first split the dataset into serval parts, which are called shards, partitions or splits. As a result, it forms a three-level abstraction:

DataSet - Partition - Object