Data Model

This vignette explores the ws_monitor data model used throughout the PWFSLSmoke package to store and work with monitoring data.

The PWFSLSmoke package is designed to provide a compact, full-featured suite of utilities for working with PM 2.5 data used to monitor wildfire smoke. A uniform data model provides consistent data access across monitoring data available from different agencies. The core data model in this package is defined by the ws_monitor object used to store data associated with groups of individual monitors.

To work efficiently with the package it is important to understand the structure of this data object and which functions operate on it. Package functions that begin with monitor_, expect objects of class ws_monitor as their first argument. (‘ws_’ stands for ‘wildfire smoke’)

Monitoring Data

Monitoring data will typically be obtained from an agency charged with archiving data acquired at monitoring sites. For wildfire smoke, the primary pollutant is PM 2.5 and the sites archiving this data include EPA, AirNow, AIRSIS and WRCC.

The data model for monitoring data consists of an R list with two dataframes: data and meta.

The data dataframe contains all hourly measurements organized with rows (the ‘unlimited’ dimension) as unique timesteps and columns as unique monitor deployments. The very first column is always named datetime and contains the POSIXct datetime in Coordinated Universal Time (UTC).

The meta dataframe contains all metadata associated with monitor deployment sites and is organized with rows as unique monitor deployments and columns as site attributes. The following columns are guaranteed to exist in the meta dataframe:

monitorID – unique ID for each site-instrument combination
longitude – decimal degrees East
latitude – decimal degrees North
elevation – meters above sea level
timezone – Olson timezone
countryCode – ISO 3166-1 alpha-2 code
stateCode – ISO 3166-2 alpha-2 code

(The MazamaSpatialUtils package is used to assign timezones and state and country codes.)

Starting with version 1.0 of the package, the following additional columns (mostly for internal use) will always exist:

siteName – familiar name for a monitoring site
countyName – county/province name
msaName – US Census Bureau ‘Metropolitan/Micropolitan Statistical Area’
agencyName – agency responsible for collecting the data
monitorType – broad instrument categories for E-Sampler, EBAM or BAM-1020
siteID – unique identifier for each site
instrumentID – sequential identifier for each instrument at a single site
aqsID – AQS site identifier (often used as the siteID)
pwfslID – PWFSL site identifier (used as the siteID for temporary monitors)
pwfslDataIngestSource – identifier for the source of monitoring data (e.g. AIRNOW, AIRSIS, WRCC_DUMPFILE, etc.)
telemetryAggregator – data provider for temporary monitors (e.g. ‘wrcc’ or ‘usfs.airsis’)
telemetryUnitID – unique ID for each monitoring site used within the telemetry_aggregator

These additional columns of information are much more variable and, depending on the source of data, may include many missing values.

It is important to note that the monitorID acts as a unique key that connects data with metadata. The monitorID is used for column names in the data dataframe and for row names in the meta dataframe. So the following will always be true:

rownames(ws_monitor$meta) == ws_monitor$meta$monitorID
colnames(ws_monitor$data) == c('datetime', ws_monitor$meta$monitorID)

Example 1: Exploring ws_monitor Objects

We will use the built-in “Northwest_Megafires” dataset and the monitor_subset() function to subset a ws_monitor object which we can then explore.

suppressPackageStartupMessages(library(PWFSLSmoke))

# Get some airnow data for Washington state in the summer of 2015
# NOTE:  'tlim' is interpreted as UTC unless we specify 'timezone' 
N_M <- monitor_subset(Northwest_Megafires,
                      tlim = c(20150801, 20150831),
                      timezone = "America/Los_Angeles")
WA <- monitor_subset(N_M, stateCodes = 'WA')

# 'ws_monitor' objects can be identified by their class
class(WA)

## [1] "ws_monitor" "list"

# Examine the 'meta' dataframe
dim(WA$meta)

## [1] 55 19

rownames(WA$meta)

##  [1] "530330017_01" "530330080_01" "530050002_01" "530330024_01" "530330057_01"
##  [6] "530332004_01" "530530029_01" "530530031_01" "530610005_01" "530611007_01"
## [11] "530630047_01" "530670013_01" "530531018_01" "530272002_01" "530310003_01"
## [16] "530730015_01" "530251002_01" "530650004_01" "530010003_01" "530750006_01"
## [21] "530750003_01" "530331011_01" "530210002_01" "530330037_01" "530710005_01"
## [26] "530750005_01" "530150015_01" "530470009_01" "530370002_01" "530090013_01"
## [31] "530610020_01" "530070010_01" "530770015_01" "530650002_01" "530470010_01"
## [36] "530770009_01" "530570015_01" "530130002_01" "530030004_01" "530110022_01"
## [41] "530579999_01" "530639997_01" "530299999_01" "530639996_01" "530410004_01"
## [46] "530770016_01" "530090015_01" "530450007_01" "530470013_01" "530570011_01"
## [51] "530350007_01" "530070011_01" "530330030_01" "530110024_01" "530090017_01"

colnames(WA$meta)

##  [1] "monitorID"             "longitude"             "latitude"             
##  [4] "elevation"             "timezone"              "countryCode"          
##  [7] "stateCode"             "siteName"              "agencyName"           
## [10] "countyName"            "msaName"               "monitorType"          
## [13] "siteID"                "instrumentID"          "aqsID"                
## [16] "pwfslID"               "pwfslDataIngestSource" "telemetryAggregator"  
## [19] "telemetryUnitID"

# Examine the 'data' dataframe
dim(WA$data)

## [1] 721  56

colnames(WA$data)

##  [1] "datetime"     "530330017_01" "530330080_01" "530050002_01" "530330024_01"
##  [6] "530330057_01" "530332004_01" "530530029_01" "530530031_01" "530610005_01"
## [11] "530611007_01" "530630047_01" "530670013_01" "530531018_01" "530272002_01"
## [16] "530310003_01" "530730015_01" "530251002_01" "530650004_01" "530010003_01"
## [21] "530750006_01" "530750003_01" "530331011_01" "530210002_01" "530330037_01"
## [26] "530710005_01" "530750005_01" "530150015_01" "530470009_01" "530370002_01"
## [31] "530090013_01" "530610020_01" "530070010_01" "530770015_01" "530650002_01"
## [36] "530470010_01" "530770009_01" "530570015_01" "530130002_01" "530030004_01"
## [41] "530110022_01" "530579999_01" "530639997_01" "530299999_01" "530639996_01"
## [46] "530410004_01" "530770016_01" "530090015_01" "530450007_01" "530470013_01"
## [51] "530570011_01" "530350007_01" "530070011_01" "530330030_01" "530110024_01"
## [56] "530090017_01"

# This should always be true
all(rownames(WA$meta) == colnames(WA$data[,-1]))

## [1] TRUE

Example 2: Manipulating ws_monitor Objects

The PWFSLSmoke package has numerous functions that can work with ws_monitor objects, all of which begin with monitor_. If you need to do something that the package functions do not provide, you can manipulate ws_monitor objects directly as long as you retain the structure of the data model.

Functions that accept and return ws_monitor objects include:

monitor_aqi()
monitor_collapse()
monitor_dailyStatistic()
monitor_dailyThreshold()
monitor_join()
monitor_nowcast()
monitor_reorder()
monitor_replaceData()
monitor_rollingMean()
monitor_scaleData()
monitor_subset()
monitor_subsetBy()
monitor_subsetByDistance()
monitor_trim()

These functions can be used with the magrittr package %>% pipe as in the following example:

# Calculate daily means for the Methow Valley from monitors in Twisp and Winthrop
TwispID <- '530470009_01'
WinthropID <- '530470010_01'
Methow_Valley_JulyMeans <- 
  Northwest_Megafires %>%
  monitor_subset(monitorIDs = c(TwispID,WinthropID)) %>%
  monitor_collapse(monitorID = 'MethowValley') %>%
  monitor_subset(tlim=c(20150701, 20150731), timezone = 'America/Los_Angeles') %>%
  monitor_dailyStatistic(minHours = 18)
# Look at the first week
Methow_Valley_JulyMeans$data[1:7,]

##     datetime MethowValley
## 1 2015-07-01      5.06875
## 2 2015-07-02      5.45625
## 3 2015-07-03      6.64375
## 4 2015-07-04     10.60625
## 5 2015-07-05     10.33750
## 6 2015-07-06     13.93750
## 7 2015-07-07     30.68542

The following code mixes use of package functions with direct manipulation of the ws_monitor object.

# Use special knowledge of AirNow IDs to subset airnow data for Spokane county monitors
SpokaneCountyIDs <- N_M$meta$monitorID[stringr::str_detect(N_M$meta$monitorID, "^53063")]
Spokane <- monitor_subset(N_M, monitorIDs = SpokaneCountyIDs)

# Apply 3-hr rolling mean
Spokane_3hr <- monitor_rollingMean(Spokane, 3, align = "center")

# 1) Replace data columns with their squares (exponentiation is not supplied by the package)
Spokane_3hr_squared <- Spokane_3hr
Spokane_3hr_squared$data[,-1] <- (Spokane_3hr$data[,-1])^2 # exclude the 'datetime' column

# NOTE:  Exponentiation is only used as an example. It does not generate a meaningful result.

# Create a daily averaged 'ws_monitor' object
Spokane_daily_3hr <- monitor_dailyStatistic(Spokane_3hr)

# 2) Check out the correlation between monitors (correlation is not supplied by the package)
data <- Spokane_daily_3hr$data[,-1] # exclude the 'datetime' column
cor(data, use = 'complete.obs')

##              530630047_01 530639997_01 530639996_01
## 530630047_01    1.0000000    0.9148673    0.9159997
## 530639997_01    0.9148673    1.0000000    0.9284175
## 530639996_01    0.9159997    0.9284175    1.0000000

This introduction to the ws_monitor data model should be enough to get you started. Lots more documentation and examples are available in the package documentation.

Best of luck exploring and understanding PM 2.5 values associated with wildfire smoke!

Data Model

Mazama Science

2020-02-12

Monitoring Data