Hey, everyone, I’m Andrew Weatherman, creator of toRvik
and lover of college basketball analytics. The goal of
toRvik
is to expand access to reliable, high-quality CBB
statistics. While analogous packages exist to pull data, like Saiem
Gilani’s brilliant hoopR
,
toRvik
requires no paid subscription or set-up and can be
immediately utilized by anyone with just a few lines of code.
toRvik
# You can install using {pacman} with the following code:
if (!requireNamespace('pacman', quietly = TRUE)){
install.packages('pacman')
}::p_load_current_gh("andreweatherman/toRvik", dependencies = TRUE, update = TRUE) pacman
toRvik
toRvik
is a package of scrapers that pull data from Barttorvik, a popular college
basketball analytics website, and return it in tidy format. Barttorvik
splits its data on a number of variables and hosts detailed player and
game statistics, while serving as a reputable, industry-recognized
metric rating system. Generally speaking, all data is avaliable back to
the 2007-08 season. More information about Barttorvik, its data, and its
metric rating system can be found here.
Package functions are syntactically structured to point to their data
source (e.g. by ‘player,’ ‘game,’ etc.) and should be considered
get
functions by nature. As of toRvik
version
1.0.3, the package exports nearly 30 functions covering the website and
its data. Some highlights include:
toRvik
requires no set-up and can be instantly executed
in any session. To understand the package, the T-Rank functions, pulling
and splitting Barttorvik’s metric rating system, are an excellent place
to start. Let’s take a glance at the top teams in T-Rank using
toRvik
:
::tic()
tictoc::bart_ratings(year=2022) %>%
toRvik::head(10)
utils#> ── Team Ratings: 2022 ────────────────────────────────────────── toRvik 1.1.0 ──
#> ℹ Data updated: 2022-09-09 08:24:24 EDT
#> # A tibble: 10 × 19
#> team conf barthag barth…¹ adj_o adj_o…² adj_d adj_d…³ adj_t adj_t…⁴ wab
#> <chr> <chr> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <dbl>
#> 1 Gonzaga WCC 0.966 1 120. 4 89.9 9 72.6 5 6.71
#> 2 Houston Amer 0.959 2 117. 10 88.5 6 63.7 336 6.15
#> 3 Kansas B12 0.958 3 120. 5 91.3 13 69.1 71 10.4
#> 4 Texas … B12 0.951 4 111. 41 85.4 1 66.3 223 6.57
#> 5 Baylor B12 0.949 5 118. 8 91.3 14 67.6 149 8.91
#> 6 Duke ACC 0.944 6 123. 1 96.0 53 67.4 161 7.19
#> 7 Tennes… SEC 0.944 7 111. 34 87.1 3 67.4 164 7.96
#> 8 Villan… BE 0.935 8 117. 9 93.0 26 62.2 350 7.29
#> 9 Arizona P12 0.934 9 118. 7 93.7 35 72.3 9 8.76
#> 10 UCLA P12 0.932 10 116. 12 92.2 20 65.4 274 5.06
#> # … with 8 more variables: nc_elite_sos <int>, nc_fut_sos <dbl>,
#> # nc_cur_sos <dbl>, ov_elite_sos <int>, ov_fut_sos <dbl>, ov_cur_sos <dbl>,
#> # seed <dbl>, year <int>, and abbreviated variable names ¹barthag_rk,
#> # ²adj_o_rk, ³adj_d_rk, ⁴adj_t_rk
::toc()
tictoc#> 0.37 sec elapsed
Here, the bart_ratings
function returned the top ten
teams in T-Rank in the current season. We are also presented with each
team’s adjusted efficiencies, their adjusted tempo, and two forms of
strength of schedule (documented in bart_ratings
). But what
if we want these same measures in home games only? We would use
bart_factors
and input ‘H’ as location:
::tic()
tictoc::bart_factors(location='H') %>%
toRvik::head(10)
utils#> ── Team Factors ──────────────────────────────────────────────── toRvik 1.1.0 ──
#> ℹ Data updated: 2022-09-09 08:24:25 EDT
#> # A tibble: 10 × 22
#> team conf rating rank adj_o adj_o…¹ adj_d adj_d…² tempo off_ppp off_efg
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Houston Amer 32.7 1 116. 15 83.0 1 66.1 117. 54.2
#> 2 Gonzaga WCC 29.7 2 120. 5 89.9 18 72.7 123. 60.1
#> 3 Baylor B12 28.8 3 116. 9 87.6 9 69.3 116. 55.0
#> 4 Villano… BE 28.8 4 123. 2 94.0 50 63.2 122. 57.6
#> 5 Purdue B10 28.4 5 124. 1 96.0 81 67.9 125. 58.2
#> 6 Auburn SEC 27.6 6 115. 17 87.5 8 72.9 113. 53.1
#> 7 Texas T… B12 27.5 7 116. 14 88.2 11 68.6 116. 57.3
#> 8 Tenness… SEC 27.5 8 113. 28 86.0 6 69.5 112. 53.1
#> 9 UCLA P12 26.5 9 116. 11 89.7 16 69.6 115. 54.3
#> 10 Texas B12 25.3 10 109. 65 83.4 2 63.8 109. 51.6
#> # … with 11 more variables: off_to <dbl>, off_or <dbl>, off_ftr <dbl>,
#> # def_ppp <dbl>, def_efg <dbl>, def_to <dbl>, def_or <dbl>, def_ftr <dbl>,
#> # wins <int>, losses <int>, games <int>, and abbreviated variable names
#> # ¹adj_o_rank, ²adj_d_rank
::toc()
tictoc#> 0.43 sec elapsed
And now, we have four factor data and metric ratings for home
games only. The bart_factors
function, and the
analogous bart_conf_factors
, takes venue, game type, date
range, and opponent strength as additional splits. Great, but what if we
want to explore rating trends over time? toRvik
gives us
that ability with bart_archive
, a function that pulls
adjusted ratings and projected records from the morning of a desired
date:
::tic()
tictoc::bart_archive('20220113') %>%
toRvik::head(10)
utils#> ── Archive Ratings ───────────────────────────────────────────── toRvik 1.1.0 ──
#> ℹ Data updated: 2022-09-09 08:24:25 EDT
#> # A tibble: 10 × 16
#> rank team conf record barthag adj_o adj_o…¹ adj_d adj_d…² adj_t…³ adj_t…⁴
#> <int> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <int> <dbl> <int>
#> 1 125 Abile… WAC 11-5 0.612 100. 239 96.2 46 71.3 58
#> 2 260 Air F… MWC 8-5 0.321 93.7 331 100. 114 63.9 342
#> 3 151 Akron MAC 9-4 0.535 105. 130 104. 188 64.9 327
#> 4 21 Alaba… SEC 11-5 0.880 117. 10 98.7 89 72.3 28
#> 5 330 Alaba… SWAC 4-10 0.155 88.5 356 103. 156 67.6 234
#> 6 305 Alaba… SWAC 4-13 0.199 95.8 309 108. 285 72.0 37
#> 7 303 Albany AE 5-10 0.205 91.3 351 103. 158 66.2 305
#> 8 262 Alcor… SWAC 4-11 0.316 98.5 266 105. 224 68.0 207
#> 9 347 Ameri… Pat 4-10 0.0910 93.9 329 115. 347 67.2 262
#> 10 188 Appal… SB 9-9 0.451 100. 235 102. 144 64.3 338
#> # … with 5 more variables: proj_record <chr>, proj_conf_record <chr>,
#> # wab <dbl>, wab_rk <int>, date <chr>, and abbreviated variable names
#> # ¹adj_o_rk, ²adj_d_rk, ³adj_tempo, ⁴adj_tempo_rk
::toc()
tictoc#> 0.33 sec elapsed
Perhaps the most valuable functions in toRvik
concern
granular analysis. The package gives us the ability to explore advanced
statistics at a game-by-game level for every Division 1 player since the
2007-08 season using bart_player_game
.
::tic()
tictoc::bart_player_game(year=2022, stat='advanced', team='Duke') %>%
toRvik::arrange(desc(net)) %>%
dplyr::head(10)
utils#> ── Player Game Stats ─────────────────────────────────────────── toRvik 1.1.0 ──
#> ℹ Data updated: 2022-09-09 08:24:26 EDT
#> # A tibble: 10 × 25
#> date year player exp team conf opp result min pts usg ortg
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2021-12-… 2022 AJ Gr… Fr Duke ACC Sout… W 22 19 16.7 214.
#> 2 2021-11-… 2022 Wende… Jr Duke ACC Army W 35 19 22.9 142.
#> 3 2021-11-… 2022 Wende… Jr Duke ACC Lafa… W 29 23 25.2 159.
#> 4 2022-01-… 2022 Mark … So Duke ACC Nort… W 27 19 25 144.
#> 5 2022-03-… 2022 Mark … So Duke ACC Cal … W 32 15 19.5 156.
#> 6 2021-11-… 2022 Paolo… Fr Duke ACC The … W 31 28 29.3 157.
#> 7 2022-03-… 2022 Paolo… Fr Duke ACC Texa… W 37 22 23.6 146.
#> 8 2022-01-… 2022 AJ Gr… Fr Duke ACC Loui… W 34 22 17.2 163.
#> 9 2022-03-… 2022 Trevo… Fr Duke ACC Pitt… W 34 27 25.9 175.
#> 10 2021-11-… 2022 AJ Gr… Fr Duke ACC Lafa… W 21 18 16.4 188.
#> # … with 13 more variables: or_pct <dbl>, dr_pct <dbl>, ast_pct <dbl>,
#> # to_pct <dbl>, stl_pct <dbl>, blk_pct <dbl>, bpm <dbl>, obpm <dbl>,
#> # dbpm <dbl>, net <dbl>, poss <dbl>, id <dbl>, game_id <chr>
::toc()
tictoc#> 0.59 sec elapsed
Here, bart_player_game
returned the 10 highest
individual net BPMs by a Duke player this season. The function takes
‘box,’ ‘shooting,’ and ‘advanced’ as stat inputs, and I welcome you to
explore each one in your own session. But what if we want to investigate
similar performance at a season level? Well,
bart_player_season
gives us that option – also taking
‘box,’ ‘shooting,’ and ‘advanced’ as stat inputs.
::tic()
tictoc::bart_player_season(year=2022, stat='shooting', team='Duke') %>%
toRvik::arrange(desc(mid_a)) %>%
dplyr::head(5)
utils#> ── Player Season Stats ───────────────────────────────────────── toRvik 1.1.0 ──
#> ℹ Data updated: 2022-09-09 08:24:26 EDT
#> # A tibble: 5 × 33
#> player pos exp team conf g mpg ppg p_per usg ortg efg ts
#> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Paolo… Wing… Fr Duke ACC 39 33.0 17.2 20.9 27.2 111. 52 55.7
#> 2 Wende… Comb… Jr Duke ACC 39 33.9 13.4 15.8 20.3 121. 56.9 60.5
#> 3 AJ Gr… Wing… Fr Duke ACC 39 24.3 10.4 17.1 16.9 127. 61.3 63.0
#> 4 Jerem… Comb… So Duke ACC 39 29 8.62 11.9 17.7 105. 47.7 51.5
#> 5 Trevo… Comb… Fr Duke ACC 36 30.2 11.5 15.2 20.1 110. 49.6 52.0
#> # … with 20 more variables: ftm <int>, fta <int>, ft_pct <dbl>, two_m <int>,
#> # two_a <int>, two_pct <dbl>, three_m <int>, three_a <int>, three_pct <dbl>,
#> # dunk_m <dbl>, dunk_a <dbl>, dunk_pct <dbl>, rim_m <dbl>, rim_a <dbl>,
#> # rim_pct <dbl>, mid_m <dbl>, mid_a <dbl>, mid_pct <dbl>, year <int>,
#> # id <int>
::toc()
tictoc#> 0.38 sec elapsed
And now, we have a tibble of season-long shooting data for Duke
players, sorted by number of mid-range attempts. Advanced metric data
can be pulled by team on a per-game basis using
bart_team_schedule
, and total team shooting splits can be
accessed using bart_team_shooting
. Game box data can be
pulled with bart_game_total
.
Lastly for this introductory vignette, we will explore
toRvik
functions for scraping tournament data. Frequent any
time on social media in college basketball circles in March, and you
will undoubtedly hear about ‘team sheets,’ detailed repositories of
strength and quality metrics used by the seeding and selection
committee. With bart_tourney_sheets
, you can pull
‘quick-hit’ team sheets in tidy format with just a single line of
code:
::tic()
tictoc::bart_tourney_sheets(year=2022) %>%
toRvik::head(10)
utils#> # A tibble: 10 × 16
#> team seed net kpi sor res_avg bpi kp sag qual_…¹ q1a q1
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 Gonzaga 1 1 5 7 6 1 1 1 1 5-2 10-3
#> 2 Arizona 1 2 3 2 2.5 3 2 2 2.3 4-2 6-3
#> 3 Houston 5 3 13 14 13.5 2 4 5 3.7 0-3 1-4
#> 4 Baylor 1 4 2 4 3 6 5 4 5 4-4 10-5
#> 5 Kentuc… 2 5 9 5 7 4 3 6 4.3 3-6 9-7
#> 6 Kansas 1 6 1 1 1 8 6 3 5.7 4-4 12-5
#> 7 Tennes… 3 7 4 3 3.5 5 7 7 6.3 4-7 11-7
#> 8 Villan… 2 8 7 8 7.5 7 11 9 9 5-4 7-6
#> 9 Texas … 3 9 17 12 14.5 13 9 14 12 5-5 8-9
#> 10 UCLA 4 10 11 15 13 9 8 10 9 2-4 5-4
#> # … with 4 more variables: q2 <chr>, q1_2 <chr>, q3 <chr>, q4 <chr>, and
#> # abbreviated variable name ¹qual_avg
::toc()
tictoc#> 0.97 sec elapsed
Returned are sheets of top teams sorted by their NCAA NET ranking.
Because this function relies on NET data, it is only available back to
the 2018-19 season. In-season performance is valuable, but what if you
want to investigate just tournament data? Well,
toRvik
gives you two options to do so:
bart_tourney_odds
and bart_tourney_results
.
The former returns metric-adjusted round probabilities by split. Let’s
explore round odds for the 2022 NCAA Tournament:
::tic()
tictoc::bart_tourney_odds(year=2022, odds='pre') %>%
toRvik::arrange(desc(s16)) %>%
dplyr::head(10)
utils#> # A tibble: 10 × 11
#> seed region team conf r64 r32 s16 e8 f4 f2 champ
#> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 West Gonzaga WCC 100 96.6 81.9 69.6 52 38.5 27.5
#> 2 1 Midwest Kansas B12 100 96.3 73.7 48.7 32.5 17.7 8.5
#> 3 1 South Arizona P12 100 94.8 72.7 37.3 21.2 12 5.4
#> 4 1 East Baylor B12 100 94.9 72.5 42.9 25.2 11.1 5.8
#> 5 2 Midwest Auburn SEC 100 91.5 70 48.4 24.8 11.7 4.8
#> 6 2 West Duke ACC 100 94.1 69.8 38.9 15.5 8.2 4
#> 7 3 West Texas Tech B12 100 92.6 68.4 40.9 17.1 9.5 5
#> 8 3 South Tennessee SEC 100 92.3 67.5 41 20.8 11.6 5.2
#> 9 5 Midwest Iowa B10 100 84.3 64.5 32.2 19.3 9.2 3.7
#> 10 2 South Villanova BE 100 90.8 63.6 34.6 16.1 8.4 3.5
::toc()
tictoc#> 0.24 sec elapsed
With the ‘odds’ argument set to ‘pre,’ we returned pre-tournament
odds and sorted by likelihood to reach the second weekend (Sweet 16).
bart_tourney_odds
also takes current odds (‘current’), odds
based on recent performance (‘recent’), and odds based on games against
strong opponents (‘t100’). This data is similarly available starting
with the 2019 tournament. Now, what if we want to explore tournament
results?
::tic()
tictoc::bart_tourney_results(min_year=2011, max_year=2021, type='conf') %>%
toRvik::head(5)
utils#> # A tibble: 5 × 18
#> conf pake pase wins loss w_percent r64 r32 s16 e8 f4 f2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 P12 11.2 11.4 55 38 0.591 38 27 18 8 2 0
#> 2 SEC 10.9 15.5 78 48 0.619 49 33 21 14 7 2
#> 3 MVC 4.1 6.1 19 15 0.559 15 11 4 2 2 0
#> 4 ACC 3.6 -0.3 102 61 0.626 64 44 31 15 5 4
#> 5 Horz 2.6 3 5 10 0.333 10 1 1 1 1 1
#> # … with 6 more variables: champ <dbl>, top2 <dbl>, f4_percent <dbl>,
#> # champ_percent <dbl>, from <dbl>, to <dbl>
::toc()
tictoc#> 0.5 sec elapsed
With bart_tourney_results
, we can return raw and
adjusted outcomes by split. Here, we returned aggregate conference
results from 2011 to 2021, sorted by PAKE – the number of wins attained
above or below KenPom expectation. The function splits by team (‘team’),
conference (‘conf’), coach (‘coach’), and seed (‘seed’) and includes
data starting in 2000.
toRvik
includes several additional functions and
capabilities that I did not describe here; take time to explore them and
those detailed in this introduction. If you have any questions, feel
free to message me on Twitter. If you run into
any bugs, please open an issue on the GitHub.
Happy exploring!