Data Details

A summary of data files provided
(code) import libraries & functions
if (!requireNamespace("librarian", quietly = TRUE)) {
  # If not installed, install the package
  install.packages("librarian")
}

librarian::shelf(
  readr,
  skimr,
  glue,
  ggplot2,
  quarto
)
source("R/getData.R")

There are three relevant types of data file provided:

(code) preview the raw data
inspectData <- function(fpath){
  print(fpath)
  df <- readr::read_csv(fpath)
  print(head(df))
}

inspectData("data/station_sampling_periods_for_all_programs.csv")
[1] "data/station_sampling_periods_for_all_programs.csv"
Rows: 998 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Source, Site, Currently Sampling?, Notes
dbl (4): Lat, Long, Sample Start Year, Sample End Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 8
  Source Site    Lat  Long `Sample Start Year` `Sample End Year`
  <chr>  <chr> <dbl> <dbl>               <dbl>             <dbl>
1 DERM   AC01   25.9 -80.1                2015              2021
2 DERM   AC03   25.9 -80.2                2015              2021
3 DERM   AC06   25.9 -80.2                2015              2021
4 DERM   AR03   25.3 -80.4                2015              2021
5 DERM   BB02   25.9 -80.1                2015              2021
6 DERM   BB04   25.9 -80.1                2015              2021
# ℹ 2 more variables: `Currently Sampling?` <chr>, Notes <chr>
(code) preview the raw data
inspectData("data/Unified_WQ_Database(2023 updated).csv")
[1] "data/Unified_WQ_Database(2023 updated).csv"
New names:
Rows: 715257 Columns: 13
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(6): Source, Site, Longitude, Parameter, Value, Units dbl (7): ...1, Latitude,
Month, Day, Year, Sample Depth, Total Depth
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
# A tibble: 6 × 13
   ...1 Source Site  Latitude Longitude  Month   Day  Year Parameter Value Units
  <dbl> <chr>  <chr>    <dbl> <chr>      <dbl> <dbl> <dbl> <chr>     <chr> <chr>
1     1 AOML   1         25.6 -80.12666…     1    29  1998 Chloroph… 1.00… ug/L 
2     2 AOML   2         25.6 -80.105        1    29  1998 Chloroph… 0.55… ug/L 
3     3 AOML   3         25.6 -80.08333…     1    29  1998 Chloroph… 0.69… ug/L 
4     4 AOML   4         25.1 -80.38         1    29  1998 Chloroph… 0.42… ug/L 
5     5 AOML   5         25.1 -80.35333…     1    29  1998 Chloroph… 0.66… ug/L 
6     6 AOML   6         25.1 -80.315        1    29  1998 Chloroph… 0.26… ug/L 
# ℹ 2 more variables: `Sample Depth` <dbl>, `Total Depth` <dbl>

Here is a summarizing table of what I see in each file we were provided:

filename geospatial info dt info
Merged_*.csv depth?+source+site MM-YYYY
station_sampling_periods_for_all_programs.csv lat+lon+source+site start+end year + still sampling
Unified_WQ_Database_*.csv lat+lon+source+site MM-YYYY start+end + still sampling

So… in theory Unified_WQ_Database_*.csv contains everything we could need.

Exploratory analyses

A skim view of the data as it loads in:

(code) skim the data
data <- readr::read_csv("data/Unified_WQ_Database(2023 updated).csv")

skimr::skim(data)
Data summary
Name data
Number of rows 715257
Number of columns 13
_______________________
Column type frequency:
character 6
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Source 94 1.00 3 21 0 12 0
Site 0 1.00 1 28 0 2510 0
Longitude 19248 0.97 3 12 0 10580 0
Parameter 0 1.00 6 24 0 12 0
Value 138417 0.81 1 13 0 96308 0
Units 968 1.00 2 9 0 12 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
…1 0 1.00 357629.00 206477.06 1.00 178815.00 357629.00 536443.00 715257.00 ▇▇▇▇▇
Latitude 19208 0.97 25.42 0.83 23.67 24.71 25.17 25.90 30.79 ▇▇▂▁▁
Month 114 1.00 6.56 3.42 1.00 4.00 7.00 10.00 12.00 ▇▅▆▅▇
Day 1614 1.00 14.12 8.23 1.00 7.00 13.00 20.00 31.00 ▇▇▆▅▃
Year 114 1.00 2012.47 8.90 1995.00 2004.00 2015.00 2021.00 2024.00 ▃▃▃▃▇
Sample Depth 22214 0.97 3.57 12.20 0.00 0.50 0.50 3.50 2494.00 ▇▁▁▁▁
Total Depth 343781 0.52 9.01 9.49 0.00 3.00 6.00 11.18 120.57 ▇▁▁▁▁

The Value is reading in as “Character” when it should be “Numeric”. We can convert it and view the rows that are being lost because they read as non-numeric:

(code) view bad rows
data <- getData()  # the getData in getData.R reads and converts the Value col 
New names:
• `` -> `...1`
Warning in getData(): NAs introduced by coercion
(code) view bad rows
na_rows <- is.na(data$Value) & !is.na(data$Value_orig)
na_introduced_rows <- data[na_rows, ]

print(na_introduced_rows)
# A tibble: 4,989 × 17
    ...1 Source Site    Latitude Longitude Month  Year Parameter     Value Units
   <dbl> <chr>  <chr>   <chr>    <chr>     <dbl> <dbl> <chr>         <dbl> <chr>
 1 10172 AOML   AMI4    27.38060 -83.03420     8  2022 Chlorophyll a    NA ug/L 
 2 10178 AOML   V3      27.03630 -82.61340     8  2022 Chlorophyll a    NA ug/L 
 3 60251 FIU    Sta 351 24.69170 -81.79170     1  2011 Chlorophyll a    NA ug/L 
 4 60252 FIU    Sta 351 24.69170 -81.79170     4  2011 Chlorophyll a    NA ug/L 
 5 60253 FIU    Sta 351 24.69170 -81.79170     7  2011 Chlorophyll a    NA ug/L 
 6 60309 FIU    Sta 352 24.77580 -81.78300     1  2011 Chlorophyll a    NA ug/L 
 7 60310 FIU    Sta 352 24.77580 -81.78300     4  2011 Chlorophyll a    NA ug/L 
 8 60311 FIU    Sta 352 24.77580 -81.78300     7  2011 Chlorophyll a    NA ug/L 
 9 60367 FIU    Sta 353 24.85830 -81.77670     1  2011 Chlorophyll a    NA ug/L 
10 60368 FIU    Sta 353 24.85830 -81.77670     4  2011 Chlorophyll a    NA ug/L 
# ℹ 4,979 more rows
# ℹ 7 more variables: `Total Depth` <dbl>, `Sample Depth` <dbl>,
#   `Trend Analysis` <chr>, Continuous <chr>, `Start Date` <dbl>,
#   `End Date` <chr>, Value_orig <chr>

The unique problem values in the “Values” column:

(code) view bad Value values
print(unique(na_introduced_rows$Value_orig))
[1] "Spilled"       "TRUE"          "FALSE"         "*Non-detect"  
[5] "*Not Reported" "ND"