Data Load Design

Key point: this is classic ETL so let's reuse those patterns and tooling.


In more detail, dividing ET(transform) from L(oad):

Load step in even more detail


We will use AirFlow.


What is a Tabular Data Resource?

See Frictionless Specs. For our purposes:

NB: even if you want to go direct loading route (a la XLoader) and forget types you still need encoding etc sorted – and it still fits in diagram above (Table Schema is just trivial – everything is strings).

What is datastore and how to create the DataStore entry

Create an entry

curl -X POST -H "Authorization: {YOUR-API-KEY}"

-d '{
  "resource": {"package_id": "{PACKAGE-ID}"},
  "fields": [ {"id": "a"}, {"id": "b"} ]

Options for Loading

There are 3 different paths we could take:

Pros and Cons of different approaches

CriteriaDatastore Write APIPG CopyDataflows
Error ReportingYesYesNo(?)
Easy of implementationYesNo(?)Yes
Works Big dataNoYesYes(?)
Works well in parrallelNoYes(?)Yes(?)


Dataflows is a framework for loading, processing, manipulating data.

Notes an QA (Sep 2019)

Raw insert ~ 15m (on 1m rows) Insert with begin / commit ~5m copy ~82s (though may have limit on b/w) – and what happens if pipe breaks

Q: Is it better to but everything in DB as a string and cast later or cast and insert in DB. A: Probably cast first and insert after.

Q: Why do we rush to insert the data in DB? We will have to wait until it's casted anyways befroe use A: It's much faster to do operations id DB than outside.