Tidy Wide Format

The allData.csv file is in “long” format, with “analyte name” as a column. This document transforms the data to “wide” “tidy” format, with “analyte name” on the columns. This is done by pivoting on the analyte name column.

load libraries
librarian::shelf(
  dplyr,
  tidyr,
  here,
  DT
)
load allData.csv
library("here")
df_long <- read.csv(here("data/exports/allData.csv"))

# Show dimensions of long format data
cat("Long format dimensions:\n")
Long format dimensions:
load allData.csv
cat("  Rows:", nrow(df_long), "\n")
  Rows: 1265123 
load allData.csv
cat("  Columns:", ncol(df_long), "\n")
  Columns: 10 

Transform to Wide Format

Pre-Pivot Collision Check

Before pivoting we assess how many collisions will occur.

check for ID collisions
df_long %>%
  group_by(source, site, datetime, sample_depth, analyte) %>%
  summarise(n = n(), .groups = "drop") %>%
  filter(n > 1)
# A tibble: 249,180 × 6
   source    site  datetime   sample_depth analyte                      n
   <chr>     <chr> <chr>             <dbl> <chr>                    <int>
 1 AOML_FBBB 1     1998-01-20           NA Salinity                     2
 2 AOML_FBBB 1     1998-01-20           NA Water Temperature            2
 3 AOML_FBBB 1     1998-03-10           NA Salinity                     2
 4 AOML_FBBB 1     1998-03-10           NA Water Temperature            2
 5 AOML_FBBB 1     1998-06-30           NA Salinity                     2
 6 AOML_FBBB 1     1998-06-30           NA Water Temperature            2
 7 AOML_FBBB 14    1998-01-21           NA Salinity                     2
 8 AOML_FBBB 14    1998-01-21           NA Water Temperature            2
 9 AOML_FBBB 16    1998-12-21           NA Ammonium, Filtered (NH4)     2
10 AOML_FBBB 16    1998-12-21           NA NO2+3, Filtered              2
# ℹ 249,170 more rows

These rows will collide with n other rows (see column on far right). An example cause of this is samples that were taken at the same time on the same day. Collisions in this pivot are averaged with a mean.

Pivot

The transformation uses tidyr::pivot_wider() to convert the dataset from long format (where each row is one measurement of one analyte) to wide format (where each row represents all measurements at a specific location and time).

pivot from long to wide format
library(dplyr)
library(tidyr)

# Pivot to wide format using the actual column names from allData.csv
# Columns: source, site, datetime, analyte, value, units, latitude, longitude, sample_depth
df_wide <- df_long %>%
  pivot_wider(
    id_cols = c(source, site, datetime, units, latitude, longitude, sample_depth),
    names_from = analyte,
    values_from = value,
    values_fn = mean  # aggregate duplicates by taking mean
  )

# Show dimensions of wide format data
cat("Wide format dimensions:\n")
Wide format dimensions:
pivot from long to wide format
cat("  Rows:", nrow(df_wide), "\n")
  Rows: 354239 
pivot from long to wide format
cat("  Columns:", ncol(df_wide), "\n")
  Columns: 27 
pivot from long to wide format
cat("\nNumber of analyte columns created:", ncol(df_wide) - 7, "\n")

Number of analyte columns created: 20 

Preview Wide Format Data

preview the first few rows of wide format data
# Display first 100 rows with scrolling
DT::datatable(
  head(df_wide, 100),
  options = list(
    scrollX = TRUE,
    scrollY = "400px",
    pageLength = 10
  ),
  caption = "First 100 rows of wide format data (scroll to see all columns)"
)
summary statistics of wide format data
library(skimr)

# Skim the numeric columns (analyte measurements)
numeric_cols <- sapply(df_wide, is.numeric)
if (sum(numeric_cols) > 0) {
  cat("Summary of numeric analyte columns:\n\n")
  skim(df_wide[, numeric_cols])
} else {
  cat("No numeric columns found in wide format data.\n")
}
Summary of numeric analyte columns:
Data summary
Name df_wide[, numeric_cols]
Number of rows 354239
Number of columns 23
_______________________
Column type frequency:
numeric 23
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
latitude 157882 0.55 25.63 0.51 24.40 25.29 25.74 25.88 28.78 ▃▇▁▁▁
longitude 157882 0.55 -80.49 0.52 -85.02 -80.62 -80.28 -80.16 -80.02 ▁▁▁▁▇
sample_depth 198581 0.44 0.44 0.20 0.00 0.50 0.50 0.50 1.00 ▃▁▇▂▁
Water Temperature 263170 0.26 19.55 11.63 0.00 2.10 24.60 28.45 74.65 ▃▇▂▁▁
Salinity 266376 0.25 18.52 16.10 0.00 1.16 22.72 34.57 933.00 ▇▁▁▁▁
Dissolved Oxygen 276439 0.22 4.45 11.33 0.00 1.51 5.05 6.45 2697.44 ▇▁▁▁▁
Ammonium, Filtered (NH4) 319411 0.10 0.53 1.86 0.00 0.01 0.02 0.25 76.39 ▇▁▁▁▁
Nitrite (NO2) 307591 0.13 0.03 0.14 0.00 0.00 0.00 0.01 14.10 ▇▁▁▁▁
Nitrate (NO3) 322088 0.09 0.35 3.51 -0.01 0.00 0.01 0.04 216.06 ▇▁▁▁▁
NO2+3, Filtered 292945 0.17 0.27 2.63 0.00 0.00 0.01 0.06 218.00 ▇▁▁▁▁
Phosphate, Filtered (PO4) 311180 0.12 0.15 0.28 0.00 0.00 0.01 0.10 2.60 ▇▁▁▁▁
pH 290791 0.18 4.05 4.67 0.00 1.00 1.78 7.68 820.00 ▇▁▁▁▁
Specific Conductivity 351520 0.01 2600.53 9322.27 0.00 45.30 49.56 54.89 53465.00 ▇▁▁▁▁
Turbidity 285975 0.19 2.54 5.47 0.00 0.60 1.00 2.69 254.00 ▇▁▁▁▁
Total Kjeldahl Nitrogen 300334 0.15 0.46 0.40 0.00 0.18 0.36 0.62 9.20 ▇▁▁▁▁
Total Phosphorus 303465 0.14 0.03 0.08 0.00 0.01 0.01 0.03 3.38 ▇▁▁▁▁
Total Nitrogen 322127 0.09 0.69 3.90 0.00 0.20 0.34 0.55 296.21 ▇▁▁▁▁
Chlorophyll a, Uncorrected for Pheophytin 350110 0.01 0.62 2.09 0.04 0.22 0.32 0.49 54.60 ▇▁▁▁▁
Dissolved Oxygen Saturation 353751 0.00 93.37 16.77 36.23 84.46 94.53 100.72 147.89 ▁▂▇▂▁
Total Suspended Solids 349939 0.01 4.46 6.88 0.02 1.10 2.03 4.40 83.00 ▇▁▁▁▁
Nitrogen, organic 353999 0.00 32.90 26.56 0.00 17.32 27.23 45.02 292.07 ▇▁▁▁▁
Chlorophyll a, Corrected for Pheophytin 334548 0.06 3.70 9.42 0.00 0.39 0.97 3.22 747.00 ▇▁▁▁▁
Ammonia, Un-ionized (NH3) 341599 0.04 0.81 0.30 0.00 0.55 1.00 1.00 1.00 ▁▁▂▁▇

Additional Cleanup

ID column

To make analysis easier, an unique ID is given to each row to identify the spatial location. This identifier can be built from {source}-{site}-{depth}.

ID column Creation

create identifier column
df_wide$site_id <- paste(
  df_wide$source, 
  df_wide$site, 
  ifelse(is.na(df_wide$sample_depth), "NA", df_wide$sample_depth), 
  sep = "-"
)

# Display first 100 rows with scrolling
DT::datatable(
  head(df_wide, 100),
  options = list(
    scrollX = TRUE,
    scrollY = "400px",
    pageLength = 10
  ),
  caption = "Wide format data with site_id column (scroll to see all columns)"
)
reduce to only needed columns
# desired remaining columns: analytes, datetime, site_id, depth
# Get analyte column names (all columns except the id columns)
id_columns <- c("source", "site", "units", "latitude", "longitude", "sample_depth", "site_id", "datetime")
analyte_columns <- setdiff(names(df_wide), id_columns)

# Keep only datetime, site_id, sample_depth, and all analyte columns
df_wide <- df_wide %>%
  select(site_id, datetime, sample_depth, all_of(analyte_columns))

# Display first 100 rows with scrolling
DT::datatable(
  head(df_wide, 100),
  options = list(
    scrollX = TRUE,
    scrollY = "400px",
    pageLength = 10
  ),
  caption = "Cleaned wide format data with selected columns (scroll to see all columns)"
)

Final Wide Table Stats/Visualizations

print number of non-na values for each analyte column
id_columns <- c("source", "site", "units", "latitude", "longitude", "sample_depth", "site_id", "datetime")
analyte_columns <- setdiff(names(df_wide), id_columns)

library(dplyr)
library(tidyr)
library(knitr)

df_wide %>%
  summarise(across(all_of(analyte_columns), ~ sum(!is.na(.)))) %>%
  pivot_longer(
    cols = everything(),
    names_to = "Analyte",
    values_to = "Valid_Count"
  ) %>%
  kable(caption = "Number of Valid Values per Analyte")
Number of Valid Values per Analyte
Analyte Valid_Count
Water Temperature 91069
Salinity 87863
Dissolved Oxygen 77800
Ammonium, Filtered (NH4) 34828
Nitrite (NO2) 46648
Nitrate (NO3) 32151
NO2+3, Filtered 61294
Phosphate, Filtered (PO4) 43059
pH 63448
Specific Conductivity 2719
Turbidity 68264
Total Kjeldahl Nitrogen 53905
Total Phosphorus 50774
Total Nitrogen 32112
Chlorophyll a, Uncorrected for Pheophytin 4129
Dissolved Oxygen Saturation 488
Total Suspended Solids 4300
Nitrogen, organic 240
Chlorophyll a, Corrected for Pheophytin 19691
Ammonia, Un-ionized (NH3) 12640
summary statistics of cleaned wide format data
library(skimr)

# Skim the numeric columns (analyte measurements)
numeric_cols <- sapply(df_wide, is.numeric)
if (sum(numeric_cols) > 0) {
  cat("Summary of numeric analyte columns:\n\n")
  skim(df_wide[, numeric_cols])
} else {
  cat("No numeric columns found in wide format data.\n")
}
Summary of numeric analyte columns:
Data summary
Name df_wide[, numeric_cols]
Number of rows 354239
Number of columns 21
_______________________
Column type frequency:
numeric 21
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
sample_depth 198581 0.44 0.44 0.20 0.00 0.50 0.50 0.50 1.00 ▃▁▇▂▁
Water Temperature 263170 0.26 19.55 11.63 0.00 2.10 24.60 28.45 74.65 ▃▇▂▁▁
Salinity 266376 0.25 18.52 16.10 0.00 1.16 22.72 34.57 933.00 ▇▁▁▁▁
Dissolved Oxygen 276439 0.22 4.45 11.33 0.00 1.51 5.05 6.45 2697.44 ▇▁▁▁▁
Ammonium, Filtered (NH4) 319411 0.10 0.53 1.86 0.00 0.01 0.02 0.25 76.39 ▇▁▁▁▁
Nitrite (NO2) 307591 0.13 0.03 0.14 0.00 0.00 0.00 0.01 14.10 ▇▁▁▁▁
Nitrate (NO3) 322088 0.09 0.35 3.51 -0.01 0.00 0.01 0.04 216.06 ▇▁▁▁▁
NO2+3, Filtered 292945 0.17 0.27 2.63 0.00 0.00 0.01 0.06 218.00 ▇▁▁▁▁
Phosphate, Filtered (PO4) 311180 0.12 0.15 0.28 0.00 0.00 0.01 0.10 2.60 ▇▁▁▁▁
pH 290791 0.18 4.05 4.67 0.00 1.00 1.78 7.68 820.00 ▇▁▁▁▁
Specific Conductivity 351520 0.01 2600.53 9322.27 0.00 45.30 49.56 54.89 53465.00 ▇▁▁▁▁
Turbidity 285975 0.19 2.54 5.47 0.00 0.60 1.00 2.69 254.00 ▇▁▁▁▁
Total Kjeldahl Nitrogen 300334 0.15 0.46 0.40 0.00 0.18 0.36 0.62 9.20 ▇▁▁▁▁
Total Phosphorus 303465 0.14 0.03 0.08 0.00 0.01 0.01 0.03 3.38 ▇▁▁▁▁
Total Nitrogen 322127 0.09 0.69 3.90 0.00 0.20 0.34 0.55 296.21 ▇▁▁▁▁
Chlorophyll a, Uncorrected for Pheophytin 350110 0.01 0.62 2.09 0.04 0.22 0.32 0.49 54.60 ▇▁▁▁▁
Dissolved Oxygen Saturation 353751 0.00 93.37 16.77 36.23 84.46 94.53 100.72 147.89 ▁▂▇▂▁
Total Suspended Solids 349939 0.01 4.46 6.88 0.02 1.10 2.03 4.40 83.00 ▇▁▁▁▁
Nitrogen, organic 353999 0.00 32.90 26.56 0.00 17.32 27.23 45.02 292.07 ▇▁▁▁▁
Chlorophyll a, Corrected for Pheophytin 334548 0.06 3.70 9.42 0.00 0.39 0.97 3.22 747.00 ▇▁▁▁▁
Ammonia, Un-ionized (NH3) 341599 0.04 0.81 0.30 0.00 0.55 1.00 1.00 1.00 ▁▁▂▁▇

Analyte Co-Occurrence

Code
library(dplyr)
library(tidyr)
library(ggplot2)

# 1. compute analyte order (most -> least)
analyte_order <- df_wide %>%
  summarise(across(all_of(analyte_columns), ~ sum(!is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Analyte", values_to = "Count") %>%
  arrange(desc(Count)) %>%
  pull(Analyte)

# 2. presence matrix using sorted analytes
presence <- df_wide %>%
  mutate(across(all_of(analyte_order),
                ~ !is.na(suppressWarnings(as.numeric(.))))) %>%
  select(all_of(analyte_order))

# 3. co-occurrence matrix (rows/cols are in analyte_order)
cooccurrence_matrix <- t(presence) %*% as.matrix(presence)

# 4. tidy it and set factor levels explicitly
cooccurrence_df <- as.data.frame(cooccurrence_matrix) %>%
  mutate(Analyte1 = row.names(.)) %>%
  pivot_longer(-Analyte1,
               names_to = "Analyte2",
               values_to = "Cooccurrence") %>%
  mutate(
    Analyte1 = factor(Analyte1, levels = analyte_order),
    Analyte2 = factor(Analyte2, levels = analyte_order)
  ) %>%
  # Keep only the lower triangle (row >= column)
  filter(as.numeric(Analyte1) <= as.numeric(Analyte2))

# 5. plot with y-axis reversed
ggplot(cooccurrence_df, aes(x = Analyte1, y = Analyte2, fill = Cooccurrence)) +
  geom_tile(color = "white") +
  scale_fill_gradientn(colors = rainbow(7)) +
  scale_y_discrete(limits = rev(levels(cooccurrence_df$Analyte2))) +  # <-- reverse y-axis
  labs(
    title = "Analyte Co-occurrence Matrix",
    x = NULL,
    y = NULL,
    fill = "Co-occurrence"
  ) +
  coord_fixed() +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, size = 6, vjust = 0.5, hjust = 1),
    axis.text.y = element_text(angle = 45, size = 6, hjust = 1),
    panel.grid = element_blank()
  )

Export Wide Format Data

save wide format data to CSV
# Export the wide format data
write.csv(df_wide, here("data", "exports", "allData_wide.csv"), row.names = FALSE)

cat("Wide format data exported to: data/exports/allData_wide.csv\n")

Note: Ideas for next steps: - Multivariate analysis where you need multiple parameters as separate variables - Creating correlation matrices between different analytes - Machine learning tasks that expect features as columns - Quick comparison of multiple parameters at the same observation point