Documentation of the initial SEACOOS DB schema for in-situ observations

To provide a baseline for evaluating alternatives. This schema is still in use at UNC-CH and USC. Props to Charlton Galvarino and Jeremy Cothran for development of the schema itself. While not perfect, it has continued to supply to needs of SEACOOS and now SECOORA. Their hard work is documented below.

The documentation below uses Sea Surface Temperature (SST) as an example since most elements of the database are common for each variable.


Existing Instances

USC - original (multiple instances actually, for Archive v RT and round-robin updating)

  • supporting RS map layers for SEACOOS and SECOORA and legacy web services

UNC Chapel Hill - redundant realtime DB, spun up fall 2006.

  • supporting cached map images, web services, and map layers for SEACOOS and SECOORA.


Schema Overview

The existing SEACOOS DB schema was not designed all at once. It evolved over time with the SEACOOS enterprise and so does not have a single form. Instances running at USC and UNC-CH are slightly different as local services vary. The following documentation gives the general pictures of the table structure, functions, indexes, constraints, and sequences used to make the SEACOOS data commons work.

Tables

The DB is a collection of 50 tables for storing data, metadata, and database internals.
All tables.

The schema is organized around individual 'table-sets' for each variable. The existing schema includes table-sets for: Air Pressure, Air Temperature, Bottom Temperature, Currents (RS), Currents (IS), Salinity, SST, Water Level, Wave (IS), Wind.
All variable tables.

There are also supporting tables for organizing consistent station_ids, storing observations by station_id, time handling, and logging map views parameters (unused at UNC).
All support tables.


SST table-set:

http://trac.secoora.org/datamgmt/attachment/wiki/SeacoosDB/sst_tableset.jpg?format=raw


Functions

The SEACOOS DB has some 350+ functions defined for it. Many are builtin geospatial functions from the PostGIS extension. Documented below are the functions relevant to in-situ observation aggregation, developed specifically for the SEACOOS DB. These functions do time conversions, top of hour determination, vector processing, value preformatting, and unit conversion. See the Functions page for a full list of all SEACOOS-developed functions.

The SST table-set relies on several functions:
sst_labels_and_times()
set_sst_prod_show()

Indexes

The DB has 150+ indexes defined for it. These speed up queries, updates, and prevent duplicate insertions.

sst_prod indexes:

Name Definition Constraints
pk_id_time_stamp CREATE UNIQUE INDEX pk_id_time_stamp ON sst_prod USING btree (station_id, time_stamp) Primary key
sst_prod_gist CREATE INDEX sst_prod_gist ON sst_prod USING gist (the_geom)
sst_prod_id_z_rep_time CREATE INDEX sst_prod_id_z_rep_time ON sst_prod USING btree (station_id, z, report_time_stamp)
sst_prod_report_time_stamp CREATE INDEX sst_prod_report_time_stamp ON sst_prod USING btree (report_time_stamp)
sst_prod_seq CREATE INDEX sst_prod_seq ON sst_prod USING btree (seq)
sst_prod_station_id CREATE INDEX sst_prod_station_id ON sst_prod USING btree (station_id)
sst_prod_time_stamp CREATE INDEX sst_prod_time_stamp ON sst_prod USING btree (time_stamp)
sst_prod_top_of_hour CREATE INDEX sst_prod_top_of_hour ON sst_prod USING btree (top_of_hour)

sst_map table indexes:

Name Definition Constraints
bottom_water_temp_map_report_time_stamp CREATE INDEX bottom_water_temp_map_report_time_stamp ON sst_map USING btree (report_time_stamp)
bottom_water_temp_map_report_time_stamp_z CREATE INDEX bottom_water_temp_map_report_time_stamp_z ON sst_map USING btree (report_time_stamp, label_z)
sst_map_report_time_stamp CREATE INDEX sst_map_report_time_stamp ON sst_map USING btree (report_time_stamp)
sst_map_report_time_stamp_z CREATE INDEX sst_map_report_time_stamp_z ON sst_map USING btree (report_time_stamp, label_z)
sst_map_seq_key CREATE UNIQUE INDEX sst_map_seq_key ON sst_map USING btree (seq) Unique key

sst_changes table indexes:

Name Definition Constraints
sst_changes_id CREATE INDEX sst_changes_id ON sst_changes USING btree (station_id)
sst_changes_report_time_stamp CREATE INDEX sst_changes_report_time_stamp ON sst_changes USING btree (report_time_stamp)
sst_changes_time_stamp CREATE INDEX sst_changes_time_stamp ON sst_changes USING btree (time_stamp)

sst_time_stamp_range table indexes:

Name Definition Constraints
pk_time_stamp CREATE UNIQUE INDEX pk_time_stamp ON sst_time_stamp_range USING btree (time_stamp) Primary key


Constraints

Each table in the DB has a number of constraints defined for fields within it. These do basic data verification and limit insertions of bad data or metadata.

sst_prod table constraints:

Name Definition
$1 CHECK (srid(the_geom) = -1)
$2 CHECK (geometrytype(the_geom) = 'POINT'::text OR the_geom IS NULL)
pk_id_time_stamp PRIMARY KEY (station_id, time_stamp)
positive_up_down CHECK (positive::text = 'up'::character varying::text OR positive::text = 'down'::character varying::text OR positive::text = ::character varying::text)

sst_map table constraints:

Name Definition
$1 CHECK (srid(the_geom) = -1)
$2 CHECK (geometrytype(the_geom) = 'POINT'::text OR the_geom IS NULL)
sst_map_seq_key UNIQUE (seq)


Sequences

The DB had 20+ sequences defined on it (2 per variable table-set). These serve as unique row identifiers.

The SST table-set has a sequence for each of its two main tables:

sst_map_seq_seq value into sst_map table, seq column
sst_prod_seq value into sst_prod table, seq column



Aggregation Overview

Essentially the aggregation process grabs data from netCDF files via http, parses them, and inserts data into variable specific *_prod tables. Additional fields are added via triggers on INSERT/UPDATE to the _prod tables. Functions are then run to identify which observations are the closest to the top of each hour. Top of the hour observations are then inserted into the _map table for direct access by web services and web mapping applications. See the SST example

  1. Bash script to coordinate pieces (UNC specific, not sure about USC)
  2. Data scout downloads data files only if their creation date has changed since last scout run
  3. Scout parses each data file into individual station+variable files and station_id SQL INSERT statements
  4. Process Station files
    • Cat station SQL files together into a single file
    • Run psql against station INSERT SQL file: INSERTs into in_situ_station_id table
    • Will not insert duplicate stations: unique index on station_id
  5. Process Variable files
    • Cat variable SQL files together into a single file
    • Run psql against variable INSERT SQL file: INSERTs into *_prod table(s)
    • will not insert duplicate observations: unique index on station_id and time_stamp
  6. Trigger(s) initiate function(s) on INSERT/UPDATE to the *_prod tables
    • SST example: Function: sst_labels_and_times as a trigger on sst_prod table
    • eliminate data that are too old (> 2 days)
    • rough range checking
    • unit conversions
    • round time_stamp (TS) to nearest hour -> report_time_stamp (RTS)
    • populates value fields - like "15 deg C at 5m" string.
    • station_ID, TS, and RTS are inserted into *_changes table
    • all new fields are inserted into *_prod table
  7. Run the set_*_prod_show function(s): determines "top of hour" observations
    • SST example: Function: set_sst_prod_show()
    • for each unique station_id and RTS in *_changes table
    • uses RTS and TS to determine a single observation to serve as hourly value for that station.
    • sets top_of_hour field to 1 (yes) or null for each row in *_prod
    • truncate (drop all data) *_changes table.
  8. Run the POPULATE script for each variable to populate *_map tables with top of hour values only
    • SST example: Script: populate_sst_map.sql
    • truncate (drop all data) *_map table.
    • inserts values from *_prod into *_map where top_of_hour is 1.
  9. Run the populate_obs_by_station_id.sql to populate obs_by_station_id table
    • obs_by_station_id table has the past 2 day data by station_id



Performance

We have been timing the Aggregation routine since its inception at UNC. One big caveat: the UNC DB instance only contains two weeks of data supporting latest visualizations and map layers.

Aggregation Runtime: since 11/16/2006 changeover to coriolis.marine.unc.edu

Parse Inserts Set prod_show functions Populate Entire RoutineHourly routines counted
3' 21'4.8'11' 39.3' n = 1182Before Cluster, ReIndex done nightly
3' 18.5'5.2'9.7' 35' n = 12300 After Cluster, ReIndex done nightly



Nightly Maintenance

Initially we only deleted excess data from the DB nightly. Then we added a vacuum routine and then cluster + reindex procedures. This has made some improvements in both update performance and service performance. The biggest gain has been in the nightly routine itself which is now more thorough and much faster.

  1. Cluster largest tables
    cluster wind_prod__time_stamp on wind_prod;
    cluster sst_prod__time_stamp on sst_prod;
    cluster water_level_prod__time_stamp on water_level_prod;
    cluster air_pressure_prod__time_stamp on air_pressure_prod;
    cluster air_temperature_prod__time_stamp on air_temperature_prod;
    
  2. Delete data older than 2 weeks (UNC specifc) in *_prod tables
  3. Reindex *_prod tables
    • psql -U UID -d DB -h HOST -c "reindex table *_prod"
  4. Vacuum the entire DB
    • vacuumdb -U UID -d DB -h HOST -f -v -z
    • -f: full
    • -v: verbose
    • -z: analyze

Maintenance Runtime: since 11/16/2006 changeover to coriolis.marine.unc.edu

Cluster, Delete, ReIndex VacuumOverall MaintenanceNightly Runs counted
18'248'266'n = 62 Before Cluster, ReIndex done nightly
38'53' 90'n = 589 After Cluster, ReIndex done nightly


Archive

The primary archive responsibilities remain with the data providers as they are always the most familiar with their own data quality and processing needs. AT USC, as in-situ data becomes older than 2 weeks old, it is moved to a similarly structured but separate database for archival records. No archiving activities are onging at UNC-CH.



Attachments