if (!requireNamespace("librarian", quietly =TRUE)) {# If not installed, install the packageinstall.packages("librarian")}librarian::shelf( readr, skimr, glue, ggplot2, quarto)source("R/getData.R")
There are three relevant types of data file provided:
Rows: 998 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Source, Site, Currently Sampling?, Notes
dbl (4): Lat, Long, Sample Start Year, Sample End Year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
Rows: 715257 Columns: 13
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(6): Source, Site, Longitude, Parameter, Value, Units dbl (7): ...1, Latitude,
Month, Day, Year, Sample Depth, Total Depth
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Here is a summarizing table of what I see in each file we were provided:
filename
geospatial info
dt info
Merged_*.csv
depth?+source+site
MM-YYYY
station_sampling_periods_for_all_programs.csv
lat+lon+source+site
start+end year + still sampling
Unified_WQ_Database_*.csv
lat+lon+source+site
MM-YYYY
start+end + still sampling
So… in theory Unified_WQ_Database_*.csv contains everything we could need.
Exploratory analyses
A skim view of the data as it loads in:
(code) skim the data
data <- readr::read_csv("data/Unified_WQ_Database(2023 updated).csv")skimr::skim(data)
Data summary
Name
data
Number of rows
715257
Number of columns
13
_______________________
Column type frequency:
character
6
numeric
7
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
Source
94
1.00
3
21
0
12
0
Site
0
1.00
1
28
0
2510
0
Longitude
19248
0.97
3
12
0
10580
0
Parameter
0
1.00
6
24
0
12
0
Value
138417
0.81
1
13
0
96308
0
Units
968
1.00
2
9
0
12
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
…1
0
1.00
357629.00
206477.06
1.00
178815.00
357629.00
536443.00
715257.00
▇▇▇▇▇
Latitude
19208
0.97
25.42
0.83
23.67
24.71
25.17
25.90
30.79
▇▇▂▁▁
Month
114
1.00
6.56
3.42
1.00
4.00
7.00
10.00
12.00
▇▅▆▅▇
Day
1614
1.00
14.12
8.23
1.00
7.00
13.00
20.00
31.00
▇▇▆▅▃
Year
114
1.00
2012.47
8.90
1995.00
2004.00
2015.00
2021.00
2024.00
▃▃▃▃▇
Sample Depth
22214
0.97
3.57
12.20
0.00
0.50
0.50
3.50
2494.00
▇▁▁▁▁
Total Depth
343781
0.52
9.01
9.49
0.00
3.00
6.00
11.18
120.57
▇▁▁▁▁
The Value is reading in as “Character” when it should be “Numeric”. We can convert it and view the rows that are being lost because they read as non-numeric:
(code) view bad rows
data <-getData() # the getData in getData.R reads and converts the Value col na_rows <-is.na(data$Value) &!is.na(data$Value_orig)na_introduced_rows <- data[na_rows, ]print(na_introduced_rows)
# A tibble: 0 × 17
# ℹ 17 variables: ...1 <dbl>, Source <chr>, Site <chr>, Latitude <dbl>,
# Longitude <dbl>, Month <dbl>, Day <dbl>, Year <dbl>, Parameter <chr>,
# Value <dbl>, Units <chr>, Sample.Depth <dbl>, Total.Depth <dbl>,
# verbatimValue <dbl>, VerbatimLatitude <dbl>, verbatimLongitude <dbl>,
# Value_orig <dbl>