The latest release of the baseballr
includes a function for acquiring player statistics from the NCAA’s website for baseball teams
across the three major divisions (I, II, III).
The function, ncaa_scrape
, requires the user to pass
values for three parameters for the function to work:
school_id
: numerical code used by the NCAA for each
school year
: a four-digit year type
: whether
to pull data for batters or pitchers
If you want to pull batting statistics for Vanderbilt for the 2013 season, you would use the following:
library(baseballr)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.2.1
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
ncaa_scrape(736, 2021, "batting") %>%
select(year:OBPct)
#> ── NCAA Baseball Team Stats data from stats.ncaa.org ──────── baseballr 1.3.0 ──
#> ℹ Data updated: 2022-09-09 03:29:54 EDT
#> # A tibble: 41 × 12
#> year school confer…¹ divis…² Jersey Player Yr Pos GP GS BA
#> <int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2021 Vanderbilt SEC 1 51 Bradf… Fr OF 67 67 0.336
#> 2 2021 Vanderbilt SEC 1 25 Nolan… So INF 66 66 0.26
#> 3 2021 Vanderbilt SEC 1 99 Gonza… So INF 61 58 0.28
#> 4 2021 Vanderbilt SEC 1 9 Young… So INF 61 61 0.252
#> 5 2021 Vanderbilt SEC 1 12 Keega… Jr UT 60 60 0.345
#> 6 2021 Vanderbilt SEC 1 8 Thoma… Jr OF 59 57 0.305
#> 7 2021 Vanderbilt SEC 1 5 Rodri… So C 58 52 0.249
#> 8 2021 Vanderbilt SEC 1 16 Bulge… Fr UT 50 41 0.274
#> 9 2021 Vanderbilt SEC 1 6 Kolwy… Jr INF 43 39 0.29
#> 10 2021 Vanderbilt SEC 1 19 LaNev… So OF 37 19 0.286
#> # … with 31 more rows, 1 more variable: OBPct <dbl>, and abbreviated variable
#> # names ¹conference, ²division
The same can be done for pitching, just by changing the
type
parameter:
ncaa_scrape(736, 2021, "pitching") %>%
select(year:ERA)
#> ── NCAA Baseball Team Stats data from stats.ncaa.org ──────── baseballr 1.3.0 ──
#> ℹ Data updated: 2022-09-09 03:29:55 EDT
#> # A tibble: 41 × 12
#> year school confer…¹ divis…² Jersey Player Yr Pos GP App GS
#> <int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2021 Vanderbilt SEC 1 51 Bradf… Fr OF 67 67 NA
#> 2 2021 Vanderbilt SEC 1 25 Nolan… So INF 66 66 NA
#> 3 2021 Vanderbilt SEC 1 99 Gonza… So INF 61 61 NA
#> 4 2021 Vanderbilt SEC 1 9 Young… So INF 61 61 NA
#> 5 2021 Vanderbilt SEC 1 12 Keega… Jr UT 60 60 NA
#> 6 2021 Vanderbilt SEC 1 8 Thoma… Jr OF 59 59 NA
#> 7 2021 Vanderbilt SEC 1 5 Rodri… So C 58 58 NA
#> 8 2021 Vanderbilt SEC 1 16 Bulge… Fr UT 50 50 NA
#> 9 2021 Vanderbilt SEC 1 6 Kolwy… Jr INF 43 43 NA
#> 10 2021 Vanderbilt SEC 1 19 LaNev… So OF 37 37 NA
#> # … with 31 more rows, 1 more variable: ERA <dbl>, and abbreviated variable
#> # names ¹conference, ²division
Now, the function is dependent on the user knowing the
school_id
used by the NCAA website. Given that, I’ve
included a ncaa_school_id_lu
function so that users can
find the school_id
they need.
Just pass a string to the function and it will return possible matches based on the school’s name:
ncaa_school_id_lu("Vand")
#> # A tibble: 10 × 6
#> school conference school_id year division conference_id
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Vanderbilt SEC 736 2013 1 911
#> 2 Vanderbilt SEC 736 2014 1 911
#> 3 Vanderbilt SEC 736 2015 1 911
#> 4 Vanderbilt SEC 736 2016 1 911
#> 5 Vanderbilt SEC 736 2017 1 911
#> 6 Vanderbilt SEC 736 2018 1 911
#> 7 Vanderbilt SEC 736 2019 1 911
#> 8 Vanderbilt SEC 736 2020 1 911
#> 9 Vanderbilt SEC 736 2021 1 911
#> 10 Vanderbilt SEC 736 2022 1 911