Last time I scraped the odds and winner information from each game on Odds Portal. But I’ll need a lot more data to discover any long term upward trends. In this article I show how I used Beautiful Soup 4 in addition to Selenium to scrape team and date information from each game. Then I validate my data by visualizing very simple patterns.
All the code used to write this article can be found on my GitHub.
First I used Selenium to find the element with the CSS selector class="min-h-[28px]"
. This CSS points to where all the game info is on the page. It is a brittle scraping strategy because that minimum height might change. The div
nesting structure I’m about to describe is also brittle for the same reason. We can’t control when Odds Portal will decide to roll out and update to these pages.
Ignoring that, I got all the HTML inside that Selenium element, then switched over to using Beautiful Soup 4 and read the current div via its tag name. This div
’s children were the game rows with the information I needed.
After an entire loop of BS4 and Selenium running one after the other, I got this raw data:
Away Team,Home Team,Away Odds,Home Odds,Winner,Date
Houston Texans,Baltimore Ravens,+252,-323,1,Yesterday, 25 Dec
Pittsburgh Steelers,Kansas City Chiefs,+108,-128,1,_
Green Bay Packers,New Orleans Saints,-1111,+716,0,23 Dec 2024
Dallas Cowboys,Tampa Bay Buccaneers,+188,-227,0,22 Dec 2024
Buffalo Bills,New England Patriots,-1111,+699,0,_
Las Vegas Raiders,Jacksonville Jaguars,-133,+114,0,_
Miami Dolphins,San Francisco 49ers,+106,-125,0,_
Seattle Seahawks,Minnesota Vikings,+116,-137,1,_
Atlanta Falcons,New York Giants,-476,+363,0,_
Carolina Panthers,Arizona Cardinals,+206,-250,0,_
Chicago Bears,Detroit Lions,+256,-323,1,_
Cincinnati Bengals,Cleveland Browns,-526,+400,0,_
Indianapolis Colts,Tennessee Titans,-204,+171,0,_
New York Jets,Los Angeles Rams,+136,-159,1,_
Washington Commanders,Philadelphia Eagles,+181,-222,0,_
Baltimore Ravens,Pittsburgh Steelers,-357,+281,0,21 Dec 2024
...
The script isn’t perfect, sometimes Odds Portal will time out, or they will return an empty div
, and occasionally the page will load just the games will be cancelled and intterupt the loop. If anything goes wrong on a season’s run it is much better to fail there and have a log to look back on.
I scraped each season one at a time and it took aronud three hours I think, all in all. After the data was available locally, there are some shell scripts I created in order to have one source of input for the experiment script. They merged the CSVs together and dropped down the date column.
Some teams go to the playoffs and some don’t, but the distribution of the number of bets per team should even out, especially over eighteen seasons. 1
At first, all the dates are all squished together when I graph over time. 1
I can explicitly read in the Date
column as a Date type 1 2, but even when I do that, the graph is still messed up. Why?
Then I took a look at the screenshot of Odds Portal from above… and noticed each page displays the games in descending order, from most recent to least. Sorting on the pandas data frame got me the view I expected.
The plateaus must be when the bankroll remains static during the off-season. Matplotlib programmatically adds in the dates that aren’t actually in the data, since we specified its type.
Here is what all the teams graphed together over all time looks like.
The only team to ever be up against the books even for a small amount of time was the Los Angeles Rams.
What happens if we specify date ranges to zoom in on some of the rare seasons when the books lost money to a team? 1. For example, in 2014 some fleeting wins were had by the Panther bettor.
And, notably, Texans fans had an all-time season in 2021-2022.
There are many ways of defining “local maxima” on data like this, but for now we can search for back to back winners. These are games where, provided they were solely betting on this team all season, the bettor wins and makes money.
Intuitively one senses there are not very many games like this. So we need to ditch the lines in our plot graph. Then, via numpy arrays and masks,we can color our bankroll red or green depending on if we are making or losing money 1 2 3.
Applying this to all teams over all seasons is too much for matplotlib, but for a single season we can all back-to-back wins for any given tea in a season.
Thinking that you can beat the house is a designed mistake. It would be hard to find a bettor who has an upwardly trending bankroll history. But, clearly it is possible to beat the books, as many bettors publicly state that they advanced computer models are part of their successful careers in sports gambling 1 2. I am learning what these methods are by my own reasearch.