Documentation of the initial SEACOOS DB schema for in-situ observations
To provide a baseline for evaluating alternatives. This schema is still in use at UNC-CH and USC. Props to Charlton Galvarino and Jeremy Cothran for development of the schema itself. While not perfect, it has continued to supply to needs of SEACOOS and now SECOORA. Their hard work is documented below.
The documentation below uses Sea Surface Temperature (SST) as an example since most elements of the database are common for each variable.
Existing Instances
USC - original (multiple instances actually, for Archive v RT and round-robin updating)
- supporting RS map layers for SEACOOS and SECOORA and legacy web services
UNC Chapel Hill - redundant realtime DB, spun up fall 2006.
- supporting cached map images, web services, and map layers for SEACOOS and SECOORA.
Schema Overview
The existing SEACOOS DB schema was not designed all at once. It evolved over time with the SEACOOS enterprise and so does not have a single form. Instances running at USC and UNC-CH are slightly different as local services vary. The following documentation gives the general pictures of the table structure, functions, indexes, constraints, and sequences used to make the SEACOOS data commons work.
Tables
The DB is a collection of 50 tables for storing data, metadata, and database internals.
All tables.
The schema is organized around individual 'table-sets' for each variable. The existing schema includes table-sets for: Air Pressure, Air Temperature, Bottom Temperature, Currents (RS), Currents (IS), Salinity, SST, Water Level, Wave (IS), Wind.
All variable tables.
There are also supporting tables for organizing consistent station_ids, storing observations by station_id, time handling, and logging map views parameters (unused at UNC).
All support tables.
SST table-set:
Functions
The SEACOOS DB has some 350+ functions defined for it. Many are builtin geospatial functions from the PostGIS extension. Documented below are the functions relevant to in-situ observation aggregation, developed specifically for the SEACOOS DB. These functions do time conversions, top of hour determination, vector processing, value preformatting, and unit conversion. See the Functions page for a full list of all SEACOOS-developed functions.
The SST table-set relies on several functions:
sst_labels_and_times()
set_sst_prod_show()
Indexes
The DB has 150+ indexes defined for it. These speed up queries, updates, and prevent duplicate insertions.
sst_prod indexes:
| Name | Definition | Constraints |
| pk_id_time_stamp | CREATE UNIQUE INDEX pk_id_time_stamp ON sst_prod USING btree (station_id, time_stamp) | Primary key |
| sst_prod_gist | CREATE INDEX sst_prod_gist ON sst_prod USING gist (the_geom) | |
| sst_prod_id_z_rep_time | CREATE INDEX sst_prod_id_z_rep_time ON sst_prod USING btree (station_id, z, report_time_stamp) | |
| sst_prod_report_time_stamp | CREATE INDEX sst_prod_report_time_stamp ON sst_prod USING btree (report_time_stamp) | |
| sst_prod_seq | CREATE INDEX sst_prod_seq ON sst_prod USING btree (seq) | |
| sst_prod_station_id | CREATE INDEX sst_prod_station_id ON sst_prod USING btree (station_id) | |
| sst_prod_time_stamp | CREATE INDEX sst_prod_time_stamp ON sst_prod USING btree (time_stamp) | |
| sst_prod_top_of_hour | CREATE INDEX sst_prod_top_of_hour ON sst_prod USING btree (top_of_hour) |
sst_map table indexes:
| Name | Definition | Constraints |
| bottom_water_temp_map_report_time_stamp | CREATE INDEX bottom_water_temp_map_report_time_stamp ON sst_map USING btree (report_time_stamp) | |
| bottom_water_temp_map_report_time_stamp_z | CREATE INDEX bottom_water_temp_map_report_time_stamp_z ON sst_map USING btree (report_time_stamp, label_z) | |
| sst_map_report_time_stamp | CREATE INDEX sst_map_report_time_stamp ON sst_map USING btree (report_time_stamp) | |
| sst_map_report_time_stamp_z | CREATE INDEX sst_map_report_time_stamp_z ON sst_map USING btree (report_time_stamp, label_z) | |
| sst_map_seq_key | CREATE UNIQUE INDEX sst_map_seq_key ON sst_map USING btree (seq) | Unique key |
sst_changes table indexes:
| Name | Definition | Constraints |
| sst_changes_id | CREATE INDEX sst_changes_id ON sst_changes USING btree (station_id) | |
| sst_changes_report_time_stamp | CREATE INDEX sst_changes_report_time_stamp ON sst_changes USING btree (report_time_stamp) | |
| sst_changes_time_stamp | CREATE INDEX sst_changes_time_stamp ON sst_changes USING btree (time_stamp) |
sst_time_stamp_range table indexes:
| Name | Definition | Constraints |
| pk_time_stamp | CREATE UNIQUE INDEX pk_time_stamp ON sst_time_stamp_range USING btree (time_stamp) | Primary key |
Constraints
Each table in the DB has a number of constraints defined for fields within it. These do basic data verification and limit insertions of bad data or metadata.
sst_prod table constraints:
| Name | Definition |
| $1 | CHECK (srid(the_geom) = -1) |
| $2 | CHECK (geometrytype(the_geom) = 'POINT'::text OR the_geom IS NULL) |
| pk_id_time_stamp | PRIMARY KEY (station_id, time_stamp) |
| positive_up_down | CHECK (positive::text = 'up'::character varying::text OR positive::text = 'down'::character varying::text OR positive::text = ::character varying::text) |
sst_map table constraints:
| Name | Definition |
| $1 | CHECK (srid(the_geom) = -1) |
| $2 | CHECK (geometrytype(the_geom) = 'POINT'::text OR the_geom IS NULL) |
| sst_map_seq_key | UNIQUE (seq) |
Sequences
The DB had 20+ sequences defined on it (2 per variable table-set). These serve as unique row identifiers.
The SST table-set has a sequence for each of its two main tables:
| sst_map_seq_seq | value into sst_map table, seq column |
| sst_prod_seq | value into sst_prod table, seq column |
Aggregation Overview
Essentially the aggregation process grabs data from netCDF files via http, parses them, and inserts data into variable specific *_prod tables. Additional fields are added via triggers on INSERT/UPDATE to the _prod tables. Functions are then run to identify which observations are the closest to the top of each hour. Top of the hour observations are then inserted into the _map table for direct access by web services and web mapping applications. See the SST example
- Bash script to coordinate pieces (UNC specific, not sure about USC)
- Data scout downloads data files only if their creation date has changed since last scout run
- Scout parses each data file into individual station+variable files and station_id SQL INSERT statements
- Process Station files
- Cat station SQL files together into a single file
- Run psql against station INSERT SQL file: INSERTs into in_situ_station_id table
- Will not insert duplicate stations: unique index on station_id
- Process Variable files
- Cat variable SQL files together into a single file
- Run psql against variable INSERT SQL file: INSERTs into *_prod table(s)
- will not insert duplicate observations: unique index on station_id and time_stamp
- Trigger(s) initiate function(s) on INSERT/UPDATE to the *_prod tables
- SST example: Function: sst_labels_and_times as a trigger on sst_prod table
- eliminate data that are too old (> 2 days)
- rough range checking
- unit conversions
- round time_stamp (TS) to nearest hour -> report_time_stamp (RTS)
- populates value fields - like "15 deg C at 5m" string.
- station_ID, TS, and RTS are inserted into *_changes table
- all new fields are inserted into *_prod table
- Run the set_*_prod_show function(s): determines "top of hour" observations
- SST example: Function: set_sst_prod_show()
- for each unique station_id and RTS in *_changes table
- uses RTS and TS to determine a single observation to serve as hourly value for that station.
- sets top_of_hour field to 1 (yes) or null for each row in *_prod
- truncate (drop all data) *_changes table.
- Run the POPULATE script for each variable to populate *_map tables with top of hour values only
- SST example: Script: populate_sst_map.sql
- truncate (drop all data) *_map table.
- inserts values from *_prod into *_map where top_of_hour is 1.
- Run the populate_obs_by_station_id.sql to populate obs_by_station_id table
- obs_by_station_id table has the past 2 day data by station_id
Performance
We have been timing the Aggregation routine since its inception at UNC. One big caveat: the UNC DB instance only contains two weeks of data supporting latest visualizations and map layers.
Aggregation Runtime: since 11/16/2006 changeover to coriolis.marine.unc.edu
| Parse | Inserts | Set prod_show functions | Populate | Entire Routine | Hourly routines counted | |
| 3' | 21' | 4.8' | 11' | 39.3' | n = 1182 | Before Cluster, ReIndex done nightly |
| 3' | 18.5' | 5.2' | 9.7' | 35' | n = 12300 | After Cluster, ReIndex done nightly |
Nightly Maintenance
Initially we only deleted excess data from the DB nightly. Then we added a vacuum routine and then cluster + reindex procedures. This has made some improvements in both update performance and service performance. The biggest gain has been in the nightly routine itself which is now more thorough and much faster.
- Cluster largest tables
cluster wind_prod__time_stamp on wind_prod; cluster sst_prod__time_stamp on sst_prod; cluster water_level_prod__time_stamp on water_level_prod; cluster air_pressure_prod__time_stamp on air_pressure_prod; cluster air_temperature_prod__time_stamp on air_temperature_prod;
- Delete data older than 2 weeks (UNC specifc) in *_prod tables
- Reindex *_prod tables
- psql -U UID -d DB -h HOST -c "reindex table *_prod"
- Vacuum the entire DB
- vacuumdb -U UID -d DB -h HOST -f -v -z
- -f: full
- -v: verbose
- -z: analyze
Maintenance Runtime: since 11/16/2006 changeover to coriolis.marine.unc.edu
| Cluster, Delete, ReIndex | Vacuum | Overall Maintenance | Nightly Runs counted | |
| 18' | 248' | 266' | n = 62 | Before Cluster, ReIndex done nightly |
| 38' | 53' | 90' | n = 589 | After Cluster, ReIndex done nightly |
Archive
The primary archive responsibilities remain with the data providers as they are always the most familiar with their own data quality and processing needs. AT USC, as in-situ data becomes older than 2 weeks old, it is moved to a similarly structured but separate database for archival records. No archiving activities are onging at UNC-CH.
Attachments
- OBS_tables1.jpg (391.2 kB) -
Variable tables (p1)
, added by jcleary on 09/18/08 10:44:22. - OBS_tables2.jpg (477.6 kB) -
Variable tables (p2)
, added by jcleary on 09/18/08 10:44:34. - SEACOOS_tables.jpg (1.6 MB) -
All tables in the UNC SEACOOS schema
, added by jcleary on 09/18/08 11:04:01. - sst_tableset.jpg (47.9 kB) -
Typical table-set (SST)
, added by jcleary on 09/18/08 11:48:41. - SEACOOS_support.jpg (308.3 kB) -
Support tables
, added by jcleary on 09/18/08 14:55:22.
