In this post, we will survey data storage formats, and data management. The questions we will be asking are

  • How and where the data is store
  • Ways in which it is accessed
  • Transformations it undergoes

Assumptions

We will only be looking at open source data formats. Our assumptions are:

  • We are working with a distributed system. Data may be stored in a distributed fashion.
  • The size of data is in hundreds of terabytes.
  • Data is accessed by several consumers during developend lifecycle, for various purposes:
    • Configuration
    • Message passing
    • Data ingestion
    • Data analysis
    • Machine learning model training
    • Simulation

The use case we have in mind is data for robotics and self-driving cars. However, we won’t restrict the discussion to these domains. The data storage formats we are considering can apply to many other fields that deal with distributed computing and hundreds of terabytes of data.

Form follows function

There are many uses for data, and storage format is adapted to the function of the data.

  • We make no endorsement of any single storage format!
  • … We try to understand the function first
  • … Each function has 2-3 storage format option that are appropriate

We will be talking about:

  • Config & web file formats:
    • JSON, YAML, XML
  • Serialization file formats:
    • Pickle, NPZ
    • Protobuf, Thrift, AVRO
  • Databases
    • SQL (Relational) vs NoSQL (BigData)
    • Row-based, columnar
  • Columnar file formats
    • Parquet, DeltaLake, ORC
  • Data Lakehouse architecture
    • Databricks, AWS SageMaker
  • Deep Learning data loader
    • TFRecord (file format)
    • Petastorm (data loader library)
    • Dvc (data repository)

JSon

JSon stands for JavaScript Store Object Notation. It has a lightweight syntax, and it is easy to understand by humans. Here is an example JSon file:

{
  "employees": [
    {
      "firstName":"John",
      "lastName":"Doe"
    },
    {
      "firstName":"Anna", 
      "lastName":"Smith"
    },
    {
      "firstName":"Peter", 
      "lastName":"Jones"
    }
  ]
}

Readers familiar with JSon will immediately recognize that this defines an employees list, with each list element an array with two fiels, firstName and lastName.

Why is JSon popular?

  • Contents is easy to ‘figure out’ without a schema
  • Json is widely used in web and client/server applications. It allows client software to be developed independently of web server software.

What limitations does JSon have?

  • It is schemaless.
  • JSon is not size efficient.
  • … And JSon does not allow comments.

Being schemaless is both an advantage, and a disadvantage of JSon. More precisely - users can define their own schema for JSon files, but they would have to make their own schema implementation, because JSon does not natively support a concept of schema.

YAML

The YAML format is…

  • Used for configuration
  • … Could be used for data serialization
  • Created to solve problems of XML
  • Simple to read & write by humans

Here is the same example contents expressed in YAML:

employees:
  - firstName: John
    lastName: Doe
  - firstName: Anna 
    lastName: Smith
  - firstName: Peter
    lastName: Jones
# Some comment

As you can see, YAML allows comments. Indentation is used to express nested syntax. Basic format is an enumeration of key: value elements. List elements are enumerated using a - character.

Why is YAML popular?

  • It is widely used for configuration files for dev/ops

What limitations does it have?

  • It is schemaless
  • Not size efficient
  • When configs large and repetitive, it becomes difficult to read.

A typical difficulty is figuring out the indentation of YAML blocks when the file spans multiple screens that need to be paged-up and paged-down.

XML

  • XML is often used when data is sent from a web server to a client.
  • It is a Markup language. You can add attributes to entities.
  • HTML was original language of Web. XML is an evolution of HTML.
  • XML supports syntax enforcement through DTDs and Schemas

Here is our document expressed as XML:

<employees>
   <person firstName=”John”
           lastName=”Doe” />
   <person firstName=”Anna”
           lastName=”Smith” />
   <person firstName=”Peter”
           lastName=”Jones” />
</employees>
<!-- Some comment -->

<!-- DTD schema -->
<!DOCTYPE employees [
   <!ELEMENT person (#PCDATA)>
    <!ATTLIST person firstName CDATA #REQUIRED>
    <!ATTLIST person lastName CDATA #REQUIRED>
]>

XML distinguishes between blocks (in our case, employees and person), and attributes (in our case, firstName and lastName).

The person block is empty and only has attributes. The employees block has no attributes, and has multiple person subblocks.

When is XML popular?

  • When you want different renderings depending on style
  • When you need strong schema support
  • For a long time, XML was used to store configuration (until YAML came along)

Limitations of XML:

  • Difficult to balance end bracket by humans
  • Not size efficient

In robotics, XML is used by ROS (Robot OS), and by the Gazebo simulator.

Object Serialization

The setup for serialization is as follows:

  • Assume we have an object with embedded pointers
  • We want to pass it to another process
    • One time
    • Or, many times, repeatedly
  • The process1 CPU could be big endian, while the process 2 CPU could be little endian

  • … Or, you could save state to disk across process runs (This is more similar to sending it one time)

  • … Or, you could do more complicated things:
    • Save a set of instances of the same object into a database table, along with the object schema
    • Convert an entire database table into a set of instances of the same object, creating the object schema dynamically

Pickle

  • Used to serialize a python object
  • Can save to disk, restore from disk
  • Format is relatively compact
  • And format is self-describing

Example:

# Save a dictionary into a pickle file.
import pickle

favorite_color = {"lion": "yellow", "kitty": "red"}
pickle.dump(favorite_color, open("save.pkl", "wb"))

# Load the dictionary back from the pickle file.
favorite_color1 = pickle.load(open("save.pkl", "rb"))
# favorite_color1 is now a clone of favorite_color

The pickle format is

  • Only supported in Python
  • Easy to use
  • Mostly for saving/restoring objects to disk
  • Not really for inter-process communication
  • A bit inefficient
  • Be careful when writing from one version of Python, and reading from a different version!

References: 1 2

NPZ

  • Available in the Python numpy module only
  • It is similar to pickle, but for saving multiple objects

Example:

>>> x = np.arange(10)
>>> y = np.sin(x)
>>> outfile = TemporaryFile()
>>> np.savez(outfile, x=x, y=y)
>>> outfile.seek(0)
>>> npzfile = np.load(outfile)
>>> npzfile.files
['y', 'x']
>>> npzfile['x']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

References: 1

Protobuf

This is a serialization technology that has been open sourced by Google.

  • It is language independent
  • Both Process1, Process2 have a copy of the same object schema, and exchange only data.

  • The schema is separate from serialized data. Here is an example schema:
//polyline.proto
syntax = "proto2";

message Point {
  required int32 x = 1;
  required int32 y = 2;
  optional string label = 3;
}

message Line {
  required Point start = 1;
  required Point end = 2;
  optional string label = 3;
}

message Polyline {
  repeated Point point = 1;
  optional string label = 2;
}

For example, to express the following Json object:

{
    "userName": "Martin",
    "favouriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

you would define the following protobuf schema:

// Protobuf Schema:
message Person {
    required string userName        = 1;
    optional int64  favouriteNumber = 2;
    repeated string interests        = 3;
}

This image shows how a protobuf message is encoded:

References: 1 2

Here are the pluses and minuses of protobufs:

  • Efficient encoding
  • Supports schema evolution … … But version changes can be difficult if not properly designed

Thrift

Thrift is similar to Protobuf, in that it allows serialization of objects. It was open sourced by Facebook, and is now an Apache project.

More than Protobuf, however, Thrift can be used to generate cross-language services. A .thrift file can be compiled into a service implemented in C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml, Delphi and other languages

To express the following Json object:

{
    "userName": "Martin",
    "favouriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

you would define the following Thrift schema:

// Thrift Schema:
struct Person {
    1: required string       userName,
    2: optional i64          favouriteNumber,
    3: optional list<string> interests
}

Thrift has two encoding protocols: binary, and compact. The binary scheme uses fixed length packing for numeric data types.

The compact encoding uses variable length packing, thus being very similar to the Protobuf encoding scheme.

References: 1

AVRO

  • Serialization language similar to protobuf and thrift
  • But does not allow generation of cross-language services
  • AVRO from Apache (Protobuf from Google, Thrift from Facebook)
  • Started as a subproject of Hadoop, because Thrift did not solve all use cases necessary in that project.
  • Supports schema evolution
  • You can actually use separate writer and reader schemas with Avro. The parser can use resolution rules to translate data from the writer schema into the reader schema.
  • … But version changes can be difficult

Data (in JSON):

{
    "userName": "Martin",
    "favouriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

AVRO Schema:

record Person {
    string               userName;
    union { null, long } favouriteNumber;
    array<string>        interests;
}

This is how AVRO encoding looks like:

References: 1 2

AVRO pluses and minuses:

  • In Protobuf, every field in a record is tagged, whereas in AVRO, the entire record, file or network connection is tagged with a schema version.

Remote Procedure Calls (RPC) with Protobuf, Thrift, AVRO

Thrift, AVRO have RPC support built-in. gRPC extends Protobuf for RPC.

The Relational Database Architecture

  • Tables stored in multiple files
  • Indices for faster table lookup
  • Transaction Log, for atomic transactions
  • Catalog
  • Process Engine (read,write,log,checkpoint)