Skip to contents

This vignette compares fish condition data produced with the NEesp2::species_condition function to the condition data in the ecodata package. The NEesp2::species_condition function was built based on Laurel’s condition code. This analysis validates whether the function can be used to automate the workflow for the 2026 SOE.

Create data with NEesp2::species_condition

I’m creating condition data with two different survdat pulls, one from 2024 and one from mid-2025. I think the 2024 pull is the same one used to create the current ecodata condition data; it is on github here. The 2025 pull is a more recent pull that includes data through 2024.

# using a recent survdat pull:
all_condition <- readRDS(here::here("data-raw/condition.rds"))$survdat |>
  NEesp2::species_condition(
    LWparams = NEesp2::LWparams,
    species.codes = NEesp2::species.codes,
    output = "soe"
  )
## Joining with `by = join_by(STRATUM)`
## Joining with `by = join_by(count)`
## Removing 16811 outliers from the data set.
# using an older survdat pull:
load(here::here("data-raw/Survdat_2024.Rdata"))
all_condition_2024 <- survdat |>
  NEesp2::species_condition(
    LWparams = NEesp2::LWparams,
    species.codes = NEesp2::species.codes,
    output = "soe"
  )
## Joining with `by = join_by(STRATUM)`
## Joining with `by = join_by(count)`
## Removing 16239 outliers from the data set.

Different numbers of outliers were removed from each dataset. That could be because the more recent survdat pull has some additional data points that are being treated as outliers, or something else is going on.

Compare data

A series of plots to compare both condition calculations to what’s in ecodata.

Acadian redfish condition comparison

American plaice condition comparison

Atlantic cod condition comparison

Atlantic croaker condition comparison

Atlantic halibut condition comparison

Atlantic herring condition comparison

Atlantic mackerel condition comparison

Atlantic menhaden condition comparison

Atlantic wolffish condition comparison

Barndoor skate condition comparison

Black sea bass condition comparison

Blackbelly rosefish condition comparison

Bluefish condition comparison

Bluntnose stingray condition comparison

Bullnose ray condition comparison

Butterfish condition comparison

Clearnose skate condition comparison

Cownose ray condition comparison

Cusk condition comparison

Fourspot condition comparison

Goosefish condition comparison

Haddock condition comparison

Little skate condition comparison

Longhorn sculpin condition comparison

Ocean pout condition comparison

Offshore hake condition comparison

Pollock condition comparison

Red hake condition comparison

Rosette skate condition comparison

Roughtail stingray condition comparison

Scup condition comparison

Sea raven condition comparison

Silver hake condition comparison

Smooth dogfish condition comparison

Smooth skate condition comparison

Spiny butterfly ray condition comparison

Spiny dogfish condition comparison

Spot condition comparison

Spotted hake condition comparison

Summer flounder condition comparison

Thorny skate condition comparison

Weakfish condition comparison

White hake condition comparison

Windowpane condition comparison

Winter flounder condition comparison

Winter skate condition comparison

Witch flounder condition comparison

Yellowtail flounder condition comparison

Assess deviations

There are some small differences between the newly-calculated data and ecodata. These next analyses look at squared deviations from the ecodata values, to see if there is a systematic bias in any of the data.

full_data <- ecodata::condition |>
  dplyr::rename(ecodata_value = Value) |>
  dplyr::full_join(
    all_condition |>
      dplyr::rename(new_value = Value)
  ) |>
  dplyr::full_join(
    all_condition_2024 |>
      dplyr::rename(new_value_2024 = Value))
## Joining with `by = join_by(Time, Var, EPU, Units)`
## Joining with `by = join_by(Time, Var, EPU, Units)`
dev_data <- full_data |>
  dplyr::mutate(dev_squared = (ecodata_value - new_value)^2,
                dev_squared_2024 = (ecodata_value - new_value_2024)^2)

By EPU

EPUs across all years

EPUs summed by year

By species

Species across all years

Species summed by year

By year

What’s going on with mackerel?

Mackerel has a notable difference in the Scotian Shelf value in 2015. First, recalculate the data:

survdat_2024 <- survdat |>
  dplyr::filter(SVSPP == 121)

survdat_2025 <- readRDS(here::here("data-raw/condition.rds"))$survdat  |>
  dplyr::filter(SVSPP == 121)

cond_comparison <- survdat_2024 |>
  NEesp2::species_condition(
    LWparams = NEesp2::LWparams,
    species.codes = NEesp2::species.codes,
    output = "soe"
  ) |>
  dplyr::mutate(source = "survdat_2024") |>
  dplyr::bind_rows(
    survdat_2025 |>
      NEesp2::species_condition(
        LWparams = NEesp2::LWparams,
        species.codes = NEesp2::species.codes,
        output = "soe"
      ) |>
      dplyr::mutate(source = "survdat_2025")
  ) |>
  dplyr::bind_rows(ecodata::condition |>
                     dplyr::filter(Var == "Atlantic mackerel") |>
                     dplyr::mutate(source = "ecodata"))
## Joining with `by = join_by(STRATUM)`
## Joining with `by = join_by(count)`
## Removing 359 outliers from the data set.
## Joining with `by = join_by(STRATUM)`
## Joining with `by = join_by(count)`
## Removing 400 outliers from the data set.

Then plot:

cond_comparison |> 
  ggplot2::ggplot(ggplot2::aes(x = Time,
                               y = Value,
                               color = source)) +
  ggplot2::geom_line()+
  ggplot2::geom_point() +
  ggplot2::theme_bw() +
  ggplot2::facet_wrap(~EPU)

Scotian Shelf in 2015

2024 survdat pull

Take a look at what’s in the 2024 dataset.

mackerel_2024 <- survdat_2024 |>
      dplyr::left_join(NEesp2::strata_epu_key) |>
  dplyr::filter(EPU == "SS",
                YEAR == 2015,
                SEASON == "FALL") |>
  dplyr::select(CRUISE6, STRATUM, LENGTH, LAT, LON, EST_TOWDATE, INDID, INDWT, SEX)

knitr::kable(mackerel_2024)
CRUISE6 STRATUM LENGTH LAT LON EST_TOWDATE INDID INDWT SEX
201504 1340 28 43.33608 -66.67322 2015-10-17 15:35:26 152374 0.195 2
201504 1340 30 43.33608 -66.67322 2015-10-17 15:35:26 152338 0.270 1
201504 1340 31 43.33608 -66.67322 2015-10-17 15:35:26 152357 0.315 2
201504 1340 32 43.33608 -66.67322 2015-10-17 15:35:26 152247 0.293 1
201504 1340 32 43.33608 -66.67322 2015-10-17 15:35:26 152344 0.320 1
201504 1340 33 43.83454 -66.73564 2015-10-18 01:38:50 153828 0.336 2
201504 1351 17 44.27302 -66.70087 2015-10-18 08:31:51 154973 0.051 2
201504 1351 18 44.27302 -66.70087 2015-10-18 08:31:51 154967 0.060 2
201504 1340 34 44.03344 -66.94544 2015-10-18 14:47:12 156565 0.455 2
201504 1340 31 43.88735 -67.20750 2015-10-18 17:30:51 157004 0.335 2
201504 1340 33 43.88735 -67.20750 2015-10-18 17:30:51 157001 0.410 1
201504 1340 33 43.88735 -67.20750 2015-10-18 17:30:51 156998 0.365 1

2025 survdat pull

Take a look at what’s in the 2025 dataset.

mackerel_2025 <- survdat_2025 |>
      dplyr::left_join(NEesp2::strata_epu_key) |>
  dplyr::filter(EPU == "SS",
                YEAR == 2015,
                SEASON == "FALL") |>
  dplyr::select(CRUISE6, STRATUM, LENGTH, LAT, LON, EST_TOWDATE, INDID, INDWT, SEX)

knitr::kable(mackerel_2025)
CRUISE6 STRATUM LENGTH LAT LON EST_TOWDATE INDID INDWT SEX
201504 1340 33 43.83454 -66.73564 2015-10-18 02:38:50 153828 0.336 2
201504 1351 17 44.27302 -66.70087 2015-10-18 09:31:51 154973 0.051 2
201504 1351 18 44.27302 -66.70087 2015-10-18 09:31:51 154967 0.060 2
201504 1340 34 44.03344 -66.94544 2015-10-18 15:47:12 156565 0.455 2
201504 1340 31 43.88735 -67.20750 2015-10-18 18:30:51 157004 0.335 2
201504 1340 33 43.88735 -67.20750 2015-10-18 18:30:51 157001 0.410 1
201504 1340 33 43.88735 -67.20750 2015-10-18 18:30:51 156998 0.365 1

Looks like the October 17, 2015 tow is missing from the 2025 pull.

Double check the input data:

test_survdat <- readRDS(here::here("data-raw/condition.rds"))$survdat  |>
  dplyr::filter(stringr::str_detect(EST_TOWDATE, "2015-10-17"))

test_survdat$SVSPP |> unique() |> sort()
##  [1]  15  22  27  28  32  35  72  73  74  75  76  77  84 101 102 106 107 131 155
## [20] 156 163 193 197 301 312 502

Mackerel isn’t in there (SVSPP 121).

What’s going on with witch flounder?

Witch flounder has a notable difference in the Mid-Atlantic Bight value in 2015. First, recalculate the data:

survdat_2024 <- survdat |>
  dplyr::filter(SVSPP == 107)

survdat_2025 <- readRDS(here::here("data-raw/condition.rds"))$survdat  |>
  dplyr::filter(SVSPP == 107)

cond_2024 <- survdat_2024 |>
  NEesp2::species_condition(
    LWparams = NEesp2::LWparams,
    species.codes = NEesp2::species.codes,
    record_outliers = TRUE,
    output = "soe"
  )  
## Joining with `by = join_by(STRATUM)`
## Joining with `by = join_by(count)`
## Removing 390 outliers from the data set.
cond_2025 <- survdat_2025 |>
      NEesp2::species_condition(
        LWparams = NEesp2::LWparams,
        species.codes = NEesp2::species.codes,
    record_outliers = TRUE,
        output = "soe"
      ) 
## Joining with `by = join_by(STRATUM)`
## Joining with `by = join_by(count)`
## Removing 424 outliers from the data set.
cond_comparison <- cond_2024$condition |>
  dplyr::mutate(source = "survdat_2024") |> 
  dplyr::bind_rows(cond_2025$condition|>
      dplyr::mutate(source = "survdat_2025")) |>
  dplyr::bind_rows(ecodata::condition |>
                     dplyr::filter(Var == "Witch flounder") |>
                     dplyr::mutate(source = "ecodata"))

Plot the data:

cond_comparison |>
  ggplot2::ggplot(ggplot2::aes(x = Time,
                               y = Value,
                               color = source)) +
  ggplot2::geom_line()+
  ggplot2::geom_point() +
  ggplot2::theme_bw() +
  ggplot2::facet_wrap(~EPU)

Mid-Atlantic Bight in 2015

2024 survdat pull

witch_2024 <- survdat_2024 |>
      dplyr::left_join(NEesp2::strata_epu_key) |>
  dplyr::filter(EPU == "MAB",
                YEAR == 2015,
                SEASON == "FALL") |>
  dplyr::select(CRUISE6, STRATUM, LENGTH, LAT, LON, EST_TOWDATE, INDID, INDWT, SEX)

knitr::kable(witch_2024)
CRUISE6 STRATUM LENGTH LAT LON EST_TOWDATE INDID INDWT SEX
201504 1720 22 38.47240 -73.50522 2015-09-02 18:17:17 2537 0.046 1
201504 1020 30 39.91528 -73.08380 2015-09-13 01:45:02 59094 0.152 1
201504 1070 29 40.21977 -71.00322 2015-09-16 22:04:58 76958 0.162 1
201504 1060 34 40.49986 -70.79864 2015-09-17 02:44:47 78183 0.237 1

2025 survdat pull

witch_2025 <- survdat_2025 |>
      dplyr::left_join(NEesp2::strata_epu_key) |>
  dplyr::filter(EPU == "MAB",
                YEAR == 2015,
                SEASON == "FALL") |>
  dplyr::select(CRUISE6, STRATUM, LENGTH, LAT, LON, EST_TOWDATE, INDID, INDWT, SEX)

knitr::kable(witch_2025)
CRUISE6 STRATUM LENGTH LAT LON EST_TOWDATE INDID INDWT SEX
201504 1720 22 38.47240 -73.50522 2015-09-02 19:17:17 2537 0.046 1
201504 1020 30 39.91528 -73.08380 2015-09-13 02:45:02 59094 0.152 1
201504 1070 29 40.21977 -71.00322 2015-09-16 23:04:58 76958 0.162 1
201504 1060 34 40.49986 -70.79864 2015-09-17 03:44:47 78183 0.237 1

The two datasets are the same. Is the discrepancy caused by outliers being removed?

Outlier analysis

Outliers in the 2024 data:

# cond_comparison |>
#   dplyr::filter(Time == 2015) |>
#   dplyr::arrange(EPU)

cond_2024$outliers |>
  dplyr::filter(YEAR == 2015)  |>
  dplyr::select(CRUISE6, STRATUM, LENGTH, LAT, LON, EST_TOWDATE, INDID, INDWT, SEX) |>
      dplyr::left_join(NEesp2::strata_epu_key) |>
  dplyr::filter(EPU == "MAB") |>
  dplyr::arrange(INDWT) |>
  nrow()
## [1] 0

There are no outliers in the 2024 data set.

Outliers in the 2025 data:

cond_2025$outliers |>
  dplyr::filter(YEAR == 2015)  |>
  dplyr::select(CRUISE6, STRATUM, LENGTH, LAT, LON, EST_TOWDATE, INDID, INDWT, SEX) |>
        dplyr::left_join(NEesp2::strata_epu_key) |>
  dplyr::filter(EPU == "MAB") |>
   dplyr::arrange(INDWT) |>
  knitr::kable()
CRUISE6 STRATUM LENGTH LAT LON EST_TOWDATE INDID INDWT SEX EPU
201504 1720 22 38.4724 -73.50522 2015-09-02 19:17:17 2537 0.046 1 MAB

The 2025 data set has one outlier. This data point is considered an outlier in the 2025 data, but not in the 2024 data, which causes the difference in the 2015 condition value in the MAB.

Take a look at the outlier:

plt <- survdat_2024 |>
  dplyr::mutate(source = "2024") |>
  dplyr::bind_rows(
    survdat_2025 |>
      dplyr::mutate(source = "2025")
  ) |>
  dplyr::filter(SEASON == "FALL",
                !is.na(LENGTH),
                !is.na(INDWT)) |>
  ggplot2::ggplot(ggplot2::aes(x = LENGTH,
                               y = INDWT,
                               color = source,
                               shape = source)) +
  ggplot2::geom_point(cex = 2,
                      alpha = 0.5) +
  ggplot2::theme_bw()

plt +
  ggplot2::geom_point(ggplot2::aes(x = LENGTH,
                               y = INDWT),
                      data = witch_2025[1,],
                      inherit.aes = FALSE,
                      color = "black")

The black point is the outlier, as defined by having a relative condition < 2SD below the species’ mean relative condition over all years. I don’t think it should be treated as an outlier.

Summary

Some discrepancies are caused by differences in the underlying survdat data. These should be validated against the current data in the NEFSC database. Other discrepancies are caused by differences in outlier removal. Since outliers are defined based on standard deviations, the addition of more data changes the outlier cutoff. We would need to update the outlier removal method to be more consistent over time if we want to avoid this issue in the future.