Difference between revisions of "Encoding Relational Tables in NetCDF"
| Line 225: | Line 225: | ||
** Field EB_STATION_CODE is the unique code refers to the observation '''station'''. | ** Field EB_STATION_CODE is the unique code refers to the observation '''station'''. | ||
| − | ==== Encoding the Essential Schema in netCDF === | + | ==== Encoding the Essential Schema in netCDF ==== |
| + | |||
| + | This shows encoding a small subset of the essential schema. Table names are omitted in attributes for clarity. | ||
| + | |||
| + | dimensions: | ||
| + | |||
| + | EB_STATION = 120 | ||
| + | EB_STATION_CODE_length = 7 | ||
| + | EB_NAME_length = 40 | ||
| + | |||
| + | EC_COMPONENTS = 14 | ||
| + | EM_MATRIX_NAME_length = 40 | ||
| + | EC_COMP_NAME_length = 100 | ||
| + | EC_UNIT_length = 8 | ||
| + | |||
| + | DS_SETKEY_length = 10 | ||
| + | PR_ACRONYM_length = 10 | ||
| + | |||
| + | A1_TIME = UNLIMITED | ||
| + | |||
| + | variables: | ||
| + | // fields for table EB_STATION | ||
| + | char EB_STATION_CODE(EB_STATION,EB_STATION_CODE_length) | ||
| + | table_name = "EB_STATION" | ||
| + | primary_key = "T" | ||
| + | char EB_NAME(EB_STATION,EB_NAME_length) | ||
| + | table_name = "EB_STATION" | ||
| + | float EB_LONGITUDE(EB_STATION) | ||
| + | table_name = "EB_STATION" | ||
| + | float EB_LATITUDE(EB_STATION) | ||
| + | table_name = "EB_STATION" | ||
| + | etc... | ||
| + | |||
| + | // fields for table EC_COMPONENTS | ||
| + | char EM_MATRIX_NAME(EC_COMPONENTS,EM_MATRIX_NAME_length) | ||
| + | table_name = "EC_COMPONENTS" | ||
| + | primary_key = "T" | ||
| + | char EC_COMP_NAME(EC_COMPONENTS,EC_COMP_NAME_length) | ||
| + | table_name = "EC_COMPONENTS" | ||
| + | primary_key = "T" | ||
| + | char EC_UNIT(EC_COMPONENTS,EC_UNIT_length) | ||
| + | table_name = "EC_COMPONENTS" | ||
| + | etc... | ||
| + | |||
| + | // Table EC_COMPONENTS is encoded as a multidimensional array | ||
| + | // fields for table EC_COMPONENTS | ||
| + | char DS_SETKEY(EB_STATION,EC_COMPONENTS,DS_SETKEY_length) | ||
| + | table_name = "DS_DATA_SETS" | ||
| + | char PR_ACRONYM(EB_STATION,EC_COMPONENTS,PR_ACRONYM_length) | ||
| + | table_name = "DS_DATA_SETS" | ||
| + | etc... | ||
| + | |||
| + | // Table A1_TIME | ||
| + | char station_code(station, station_code_length) ; | ||
| + | station_code:''table_name = "station"''; | ||
| + | station_code:''primary_key = "T"''; | ||
| + | float lon(station) ; | ||
| + | lon:''table_name = "station"''; | ||
| + | float lat(station) ; | ||
| + | lat:''table_name = "station"''; | ||
| + | int time(time) ; | ||
| + | time:table_name = "time"; | ||
| + | time:primary_key = "T"; | ||
| + | '''int data_station_code(data) ;''' | ||
| + | '''data_station_code:table_name = "data";''' | ||
| + | '''data_station_code:foreign_key = "station_code";''' | ||
| + | '''data_station_code:primary_key = "T";''' | ||
| + | '''int data_time(data) ;''' | ||
| + | '''data_time:table_name = "data";''' | ||
| + | '''data_time:foreign_key = "time";''' | ||
| + | '''data_time:primary_key = "T";''' | ||
| + | '''float temperature(data) ; ''' | ||
| + | '''temperature:table_name = "data";''' | ||
| + | '''float humidity(data) ; ''' | ||
| + | '''humidity:table_name = "data";''' | ||
| + | |||
| + | The variable data_station_code is marked ''foreign_key = "station_code"'', which means, that the variable is an index to the station_code variable. Similar for time. Both are marked ''primary_key = "T"'' which means, that the combination of (data_station_code, data_time) is unique. | ||
| + | |||
| + | To read the full table (station_code, lat, lon, time, temperature, humidity), row by row: | ||
| + | for idx in range of data | ||
| + | read station: | ||
| + | read data_station_code(idx) | ||
| + | read variables station_code, lat and lon at that index | ||
| + | read time: | ||
| + | read data_time(idx) | ||
| + | read variable time at that index | ||
| + | read temperature(idx) | ||
| + | read humidity(idx) | ||
| + | |||
| + | This is an space-efficient encoding for sparse tables. | ||
Revision as of 12:09, August 31, 2011
Back to Questions and Comments about CF-1.6 Station Data Convention
Rational Behind this Proposal
The WCS service has potential to grow beyond regular, gridded coverages. Coverage types that need to be described are for example:
- station point data
- station trajectory data
- lidar data
- moving station data, like aeroplane or ship
With a relational system, it is easy to design a schema that captures all the data and the metadata.
The netCDF data format was designed to contain mainly gridded multidimensional data. The format does not have a special support for tables, which sounds like a severe restriction, but it's not.
- It's possible to agree about a simple way to encode tables in a netCDF file.
- Tables with multiple fields.
- Proper primary keys for tables.
- Foreign keys for referential integrity.
- Common relationships: many-to-many, 1-to-many, 1-to-1.
- It's possible to be efficient, the encoding does not have to be one size fits all.
- Sparse tables can be encoded row-by-row manner.
- Dense tables can be encoded as multidimensional hypercubes.
- Mix and match the two above at will.
- It's possible to be elegant.
- No mysterious indirections.
- No complex offset calculations.
- A reader-writer library can take care of following the rules, letting a programmer to operate on higher level.
Encoding Table Columns
Table is a collection of columns. These are indicated with table_name attribute.
Every table needs a primary key. It is indicated with primary_key = "T" attribute.
Example: A station table with three columns: station_code, lat and lon. There are ten rows with indexes 0..9
dimensions:
station = 10 ; // station table dimension
station_code_length = 4 ; // artificial dimension for station code, with max four character length.
variables:
char station_code(station, station_code_length) ;
station_code:table_name = "station";
station_code:primary_key = "T";
float lon(station) ;
lon:table_name = "station";
float lat(station) ;
lat:table_name = "station";
The dimension name station can be anything. It does not have to equal the table name station, but should be for readability. Same for station_code_length, it can be called anything.
A table can have multiple fields as a primary key.
Encoding Multiple Tables
Adding the second table is similar. Since new data is usually added by time, it can be made the unlimited dimension.
dimensions:
station = 10 ; // station table dimension
station_code_length = 4 ; // artificial dimension for station code, with max four character length.
time = UNLIMITED ;
variables:
char station_code(station, station_code_length) ;
station_code:table_name = "station";
station_code:primary_key = "T";
float lon(station) ;
lon:table_name = "station";
float lat(station) ;
lat:table_name = "station";
int time(time) ;
time:table_name = "time";
time:primary_key = "T";
Encoding Relations
A typical station data instance is a relation (station_code, time, temperature, humidity). The pair (station_code, time) is the primary key; data is recorded at a certain station, at a certain time. Temperature and humidity are the recorded parameters.
Encoding this into netCDF can be done in two ways, using either variable or a dimension.
Encoding Relation as a Record Array
So far we have the two unrelated tables for our two dimensions: location and time. Now let's add humidity and temperature, which refers to both of the dimensions:
dimensions:
station = 10 ; // station table dimension
station_code_length = 4 ; // artificial dimension for station code, with max four character length.
data = UNLIMITED // each row in the data table.
time = 2000 ; we can't have two unlimited dimensions.
variables:
char station_code(station, station_code_length) ;
station_code:table_name = "station";
station_code:primary_key = "T";
float lon(station) ;
lon:table_name = "station";
float lat(station) ;
lat:table_name = "station";
int time(time) ;
time:table_name = "time";
time:primary_key = "T";
int data_station_code(data) ;
data_station_code:table_name = "data";
data_station_code:foreign_key = "station_code";
data_station_code:primary_key = "T";
int data_time(data) ;
data_time:table_name = "data";
data_time:foreign_key = "time";
data_time:primary_key = "T";
float temperature(data) ;
temperature:table_name = "data";
float humidity(data) ;
humidity:table_name = "data";
The variable data_station_code is marked foreign_key = "station_code", which means, that the variable is an index to the station_code variable. Similar for time. Both are marked primary_key = "T" which means, that the combination of (data_station_code, data_time) is unique.
To read the full table (station_code, lat, lon, time, temperature, humidity), row by row:
for idx in range of data
read station:
read data_station_code(idx)
read variables station_code, lat and lon at that index
read time:
read data_time(idx)
read variable time at that index
read temperature(idx)
read humidity(idx)
This is an space-efficient encoding for sparse tables.
Encoding Relation as a Multidimensional Array
This is identical to CF-1.6 encoding. Both variables are 2-dimensional cubes.
dimensions:
station = 10 ; // station table dimension
station_code_length = 4 ; // artificial dimension for station code, with max four character length.
time = UNLIMITED ; // we can again add one time slice at a time
variables:
char station_code(station, station_code_length) ;
station_code:table_name = "station";
station_code:primary_key = "T";
float lon(station) ;
lon:table_name = "station";
float lat(station) ;
lat:table_name = "station";
int time(time) ;
time:table_name = "time";
time:primary_key = "T";
float temperature(time,station) ;
temperature:table_name = "data";
float humidity(time,station) ;
humidity:table_name = "data";
This declares two arrays, one for temperature and one for humidity. Both are two-dimensional. The dimensions time and station refer to the time and station variables. They implicitly became part of the primary key.
To find temperature at "ACAD" in "2004-17-22":
Find index for "ACAD" from the "station_code" array. Let's say it's 2.
Find index for "2004-17-22" from the time array. Lets say it's 317.
Read temperature(317,2)
This is a very space-efficient encoding, if most of the stations have data most of the time.
Summa Summarum
- A typical schema of any relational database can be encoded.
- If the data is sparse, foreign keys can be expressed as variables.
- If the data is dense, foreign keys can be expressed as dimensions, resulting in vastly smaller files than flat CSV would be.
- There can be any combination of the two, enabling the best of both worlds whatever your data will look like.
More Real World Examples
These examples are not full real-life examples of real live databases, but rather examples how to encode a typical schema. The center is usually the data table, and the data record refers to the auxliary metadata tables. Typically cases are a star or a snowflake schema.
CIRA / VIEWS
All the data is stored at the center table AirData3. The
NILU / EBAS
Full Schema
The EBAS database schema is rather large, full schema outline looks like this:
Essential Schema
The essential schema is not that big, only four tables.
The essential four tables are:
- Table EB_STATION: This contains the essential of the station information.
- Field EB_STATION_CODE is the unique code.
- Fields for lat, lon, alt; some miscellaneous fields about the history of the station.
- Table EC_COMPONENTS: This table contains the measured parameters. Fields EM_MATRIX_NAME and EC_COMP_NAME define what is being measured.
- Field EM_MATRIX_NAME defines which medium the parameter is measured in (e.g. precipitation, air, pm10, ...)
- Field EC_COMP_NAME is the name of the parameter (e.g. sulphate_total, ...)
- Various other fields, like EC_UNIT define more metadata.
- Table DS_DATA_SET describes what is called a dataset in EBAS: One parameter measured at a certain station. The full version has some more complex dependecies on instrument, method etc.
- Fields EM_MATRIX_NAME and EC_COMP_NAME refer to the EC_COMPONENTS table.
- Fields DS_STARTDATE and DS_ENDDATE provide the timestamp of the first and last measurement in the timeseries.
- There are many other fields and references to many other tables.
You can imagine a dataset as an instrument on a fixed location measuring one parameter. This leads us to the main data table.
- Table A1_TIME Contains all the observations, all the parameters from all the stations
- Field DS_SETKEY refers to the dataset.
- Fields A1_STARTTIME and A1_ENDTIME define the measurement time.
- Field A1_VALUE contains the data value
- The rest of the fields are metadata.
Reading EBAS Data
So a single data row from A1_TIME can be interpreted as follows:
- The observation value is A1_VALUE
- The observation time is (A1_STARTTIME to A1_ENDTIME)
- The observation dataset is found with DS_SETKEY from DS_DATA_SET table.
- The DS_DATA_SET contains EM_MATRIX_NAME and EC_COMP_NAME that define the observation parameter.
- Field EB_STATION_CODE is the unique code refers to the observation station.
Encoding the Essential Schema in netCDF
This shows encoding a small subset of the essential schema. Table names are omitted in attributes for clarity.
dimensions:
EB_STATION = 120
EB_STATION_CODE_length = 7
EB_NAME_length = 40
EC_COMPONENTS = 14
EM_MATRIX_NAME_length = 40
EC_COMP_NAME_length = 100
EC_UNIT_length = 8
DS_SETKEY_length = 10
PR_ACRONYM_length = 10
A1_TIME = UNLIMITED
variables:
// fields for table EB_STATION
char EB_STATION_CODE(EB_STATION,EB_STATION_CODE_length)
table_name = "EB_STATION"
primary_key = "T"
char EB_NAME(EB_STATION,EB_NAME_length)
table_name = "EB_STATION"
float EB_LONGITUDE(EB_STATION)
table_name = "EB_STATION"
float EB_LATITUDE(EB_STATION)
table_name = "EB_STATION"
etc...
// fields for table EC_COMPONENTS
char EM_MATRIX_NAME(EC_COMPONENTS,EM_MATRIX_NAME_length)
table_name = "EC_COMPONENTS"
primary_key = "T"
char EC_COMP_NAME(EC_COMPONENTS,EC_COMP_NAME_length)
table_name = "EC_COMPONENTS"
primary_key = "T"
char EC_UNIT(EC_COMPONENTS,EC_UNIT_length)
table_name = "EC_COMPONENTS"
etc...
// Table EC_COMPONENTS is encoded as a multidimensional array
// fields for table EC_COMPONENTS
char DS_SETKEY(EB_STATION,EC_COMPONENTS,DS_SETKEY_length)
table_name = "DS_DATA_SETS"
char PR_ACRONYM(EB_STATION,EC_COMPONENTS,PR_ACRONYM_length)
table_name = "DS_DATA_SETS"
etc...
// Table A1_TIME
char station_code(station, station_code_length) ;
station_code:table_name = "station";
station_code:primary_key = "T";
float lon(station) ;
lon:table_name = "station";
float lat(station) ;
lat:table_name = "station";
int time(time) ;
time:table_name = "time";
time:primary_key = "T";
int data_station_code(data) ;
data_station_code:table_name = "data";
data_station_code:foreign_key = "station_code";
data_station_code:primary_key = "T";
int data_time(data) ;
data_time:table_name = "data";
data_time:foreign_key = "time";
data_time:primary_key = "T";
float temperature(data) ;
temperature:table_name = "data";
float humidity(data) ;
humidity:table_name = "data";
The variable data_station_code is marked foreign_key = "station_code", which means, that the variable is an index to the station_code variable. Similar for time. Both are marked primary_key = "T" which means, that the combination of (data_station_code, data_time) is unique.
To read the full table (station_code, lat, lon, time, temperature, humidity), row by row:
for idx in range of data
read station:
read data_station_code(idx)
read variables station_code, lat and lon at that index
read time:
read data_time(idx)
read variable time at that index
read temperature(idx)
read humidity(idx)
This is an space-efficient encoding for sparse tables.