Data Details

A summary of data files provided

(code) import libraries & functions

if (!requireNamespace("librarian", quietly = TRUE)) {
  # If not installed, install the package
  install.packages("librarian")
}

librarian::shelf(
  readr,
  skimr,
  glue,
  ggplot2,
  quarto
)
source("R/getData.R")

There are three relevant types of data file provided:

(code) preview the raw data

inspectData <- function(fpath){
  print(fpath)
  df <- readr::read_csv(fpath)
  print(head(df))
}

inspectData("data/station_sampling_periods_for_all_programs.csv")

[1] "data/station_sampling_periods_for_all_programs.csv"

Rows: 998 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Source, Site, Currently Sampling?, Notes
dbl (4): Lat, Long, Sample Start Year, Sample End Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 6 × 8
  Source Site    Lat  Long `Sample Start Year` `Sample End Year`
  <chr>  <chr> <dbl> <dbl>               <dbl>             <dbl>
1 DERM   AC01   25.9 -80.1                2015              2021
2 DERM   AC03   25.9 -80.2                2015              2021
3 DERM   AC06   25.9 -80.2                2015              2021
4 DERM   AR03   25.3 -80.4                2015              2021
5 DERM   BB02   25.9 -80.1                2015              2021
6 DERM   BB04   25.9 -80.1                2015              2021
# ℹ 2 more variables: `Currently Sampling?` <chr>, Notes <chr>

(code) preview the raw data

inspectData("data/Unified_WQ_Database(2023 updated).csv")

[1] "data/Unified_WQ_Database(2023 updated).csv"

New names:
Rows: 715257 Columns: 13
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(6): Source, Site, Longitude, Parameter, Value, Units dbl (7): ...1, Latitude,
Month, Day, Year, Sample Depth, Total Depth
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

# A tibble: 6 × 13
   ...1 Source Site  Latitude Longitude  Month   Day  Year Parameter Value Units
  <dbl> <chr>  <chr>    <dbl> <chr>      <dbl> <dbl> <dbl> <chr>     <chr> <chr>
1     1 AOML   1         25.6 -80.12666…     1    29  1998 Chloroph… 1.00… ug/L 
2     2 AOML   2         25.6 -80.105        1    29  1998 Chloroph… 0.55… ug/L 
3     3 AOML   3         25.6 -80.08333…     1    29  1998 Chloroph… 0.69… ug/L 
4     4 AOML   4         25.1 -80.38         1    29  1998 Chloroph… 0.42… ug/L 
5     5 AOML   5         25.1 -80.35333…     1    29  1998 Chloroph… 0.66… ug/L 
6     6 AOML   6         25.1 -80.315        1    29  1998 Chloroph… 0.26… ug/L 
# ℹ 2 more variables: `Sample Depth` <dbl>, `Total Depth` <dbl>

Here is a summarizing table of what I see in each file we were provided:

filename	geospatial info	dt info
`Merged_*.csv`	depth?+source+site	MM-YYYY
`station_sampling_periods_for_all_programs.csv`	lat+lon+source+site		start+end year + still sampling
`Unified_WQ_Database_*.csv`	lat+lon+source+site	MM-YYYY	start+end + still sampling

So… in theory Unified_WQ_Database_*.csv contains everything we could need.

Exploratory analyses

A skim view of the data as it loads in:

(code) skim the data

data <- readr::read_csv("data/Unified_WQ_Database(2023 updated).csv")

skimr::skim(data)

Data summary
Name	data
Number of rows	715257
Number of columns	13
_______________________
Column type frequency:
character	6
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Source	94	1.00	3	21	12
Site	0	1.00	1	28	2510
Longitude	19248	0.97	3	12	10580
Parameter	0	1.00	6	24	12
Value	138417	0.81	1	13	96308
Units	968	1.00	2	9	12

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
…1	0	1.00	357629.00	206477.06	1.00	178815.00	357629.00	536443.00	715257.00	▇▇▇▇▇
Latitude	19208	0.97	25.42	0.83	23.67	24.71	25.17	25.90	30.79	▇▇▂▁▁
Month	114	1.00	6.56	3.42	1.00	4.00	7.00	10.00	12.00	▇▅▆▅▇
Day	1614	1.00	14.12	8.23	1.00	7.00	13.00	20.00	31.00	▇▇▆▅▃
Year	114	1.00	2012.47	8.90	1995.00	2004.00	2015.00	2021.00	2024.00	▃▃▃▃▇
Sample Depth	22214	0.97	3.57	12.20	0.00	0.50	0.50	3.50	2494.00	▇▁▁▁▁
Total Depth	343781	0.52	9.01	9.49	0.00	3.00	6.00	11.18	120.57	▇▁▁▁▁

The Value is reading in as “Character” when it should be “Numeric”. We can convert it and view the rows that are being lost because they read as non-numeric:

(code) view bad rows

data <- getData()  # the getData in getData.R reads and converts the Value col 

na_rows <- is.na(data$Value) & !is.na(data$Value_orig)
na_introduced_rows <- data[na_rows, ]

print(na_introduced_rows)

# A tibble: 0 × 17
# ℹ 17 variables: ...1 <dbl>, Source <chr>, Site <chr>, Latitude <dbl>,
#   Longitude <dbl>, Month <dbl>, Day <dbl>, Year <dbl>, Parameter <chr>,
#   Value <dbl>, Units <chr>, Sample.Depth <dbl>, Total.Depth <dbl>,
#   verbatimValue <dbl>, VerbatimLatitude <dbl>, verbatimLongitude <dbl>,
#   Value_orig <dbl>

The unique problem values in the “Values” column:

(code) view bad Value values

print(unique(na_introduced_rows$Value_orig))

numeric(0)