Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is not new. Sports betting has been legal in most parts of the World for some time. Football data in Europe is so expensive, very good businesses like Opta can make a very nice living from it. Go get a quote for them for Premier League games alone: thousands a year for only some of the data.

For years, people getting into sports analytics have done so via baseball (and the Sabremetrics community), and the NBA because the data has not been seen of commercial utility. It's been collected by fans.

That will change dramatically, but it should be resisted. Leagues and players should embrace open data because it will in the long-term lead to analysis that helps them, but more importantly, fosters a deeper interest in their game and therefore makes their own careers more valuable.



> Football data in Europe is so expensive

Depending on which data you need, there are already some good sources of free football data.[1][2][3]

Someone has also conveniently wrapped much of this in an R library.[4]

Football is actually one of the better sports in terms of easily obtainable data at no cost. Rugby is much more difficult to find extensive datasets, although there are some interesting attempts.[5]

Decent cricket data also exists in a few places[6], but generally requires faster and more regular updating. However, there are R libraries for cricket data too.[7] This one scrapes from the ESPN Cricinfo site.

It is possible to obtain horse racing data for the UK and Ireland at a reasonable price, for personal use[8] and Hong Kong does a great job of making a huge volume of horse racing data available at no cost, but not in a particularly machine usable format (extensive scraping required). Sadly, other large racing jurisdictions such as Australia and the US don't have anything free, or even reasonably priced, as far as I'm aware. Ray Paulick has covered this as a general problem for the sport for a few years now.[9]

[1]http://www.football-data.co.uk/data.php

[2]https://github.com/openfootball

[3]https://github.com/jokecamp/FootballData

[4]https://github.com/dashee87/footballR

[5]http://api.drop22.net/

[6]https://cricsheet.org/

[7]https://github.com/tvganesh/cricketr

[8]https://www.betwise.co.uk/

[9] https://www.paulickreport.com/news/the-biz/gardner-horse-rac...


I would argue that almost all of that information in your post is stats not data.

The type of data that people in this thread are talking about would be more in-line with detailed positional information about each of the players on a football pitch over 90 minutes. In a cricket context, it would be more along the lines of the exact release angle and speed for each of the bowlers.

This type of information is clearly available, as Michael Caley is able to quickly generate xG maps for an entire game[1], but I do not believe it's public.

Your [9] link points out that much more information is available to baseball betters, but even baseball has a significant walled garden in terms of data. For example, the raw data used to generate the stats in [2] is not open to the public.

[1] https://twitter.com/caley_graphics

[2] https://www.youtube.com/watch?v=tzPKlQXo6hk


You make a good point and my post requires clarity.

My links were all to post-event data, not live in-play data sources. I still wouldn't call, for example in a cricket match, the number of wickets taken by a bowler a stat. It's just data. A stat is derived from the data, for example bowling stike rate or economy. Or that a trainer had a winner at a certain race track. That's just the post-event data. If you want to derive further statistics, you have to calculate it yourself.[1]

The links above just have, for the most part, raw event data.

[1] https://blog.betwise.net/2018/06/19/loops-with-r-creating-a-...


The number of wickets taken is a stat. The raw data that informs it is the collective set of all balls bowled by a bowler.

I'm not being needlessly pedantic, it's an important distinction when considering the level of analysis that one is able to perform. If you are doing major cricket analytics, you need ball-by-ball information, including as much information about the bowler's position, movement and arm motion, batter's position, movement and stroke information, how the field is set up, conditions of the pitch, situation in the match, etc.

For example, consider a situation where we're attempting to compare two bowlers. Bowler A may have got a wicket off a shot that 95% of batters would not play, whereas Bowler B did not get a wicket despite bowling a ball that achieves a wicket 10% of the time. The stats suggest that bowler A is in better form, but a data-driven view of the game suggests that bowler B is actually in better form.

As it stands, stats are available in abundance for every major sport, but detailed data is not. If a better had access to the latter, and they were were able to parse it with an in-depth understanding of the sport, they'd be at a huge advantage versus betters that did not, and they would reap the benefits.


This is a good list, but as others have said, is not the level of detail I'm talking about.

Take the NBA for example. Let's look at this: http://toddwschneider.com/posts/ballr-interactive-nba-shot-c... - this is able to give you super detailed analysis thanks to the NBA's stats API.

The equivalent from Opta is thousands a year per competition. I was fortunate enough to get to play with detailed Opta data and ChyronHego data as part of a Man City hack day a couple of years ago. The latter data simply isn't commercially available.

For cricket, you can do something interesting with ball by ball data, but ideally you want ball tracking data. You want to know speed of release, length, speed and movement after the ball has pitched, and speed after interaction with the batsman along with angles, etc. - and that's just to get started. Ideally you want positional data on fielders, etc. too.

Don't get me wrong, this is a great starting set to get people interested, but there's a way to go for high-quality data being accessible to the hobbyist or academic researcher (although I believe Opta gives academics discounts to help make them "the" standard for clubs, etc.)


Opta data cost is nothing compared to RunningBall (also part of the Perform group).

RunningBall is all the real time data - you pretty much can't run an in-game book without it. It practically runs the in-game betting world.


I think it will also discourage the type of cheating that is almost certainly coming with our society’s new, open embrace of gambling.


This ... doesn't make a lot of sense. American athletes that matchfix will not be using American gambling sites that work heavily in tandem with the FBI.

It would be an extremely dumb cheater that only began to cheat because they could conveniently bet on a site in a state that has jurisdiction over them.


Insider trading happens all of the time, despite it being pretty easy for authorities to track down the participants when they care to do so.

You don’t even have to match fix to alter behavior in a way that has a financial return for things like fantasy sports that are stat based.


I'm not suggesting that match fixing(or point shaving or spot fixing) does not happen. There's a mountain of evidence to the contrary.

I just don't think there will be a massive increase because Americans sites allow gambling. Asian sites move billions of dollars a year in largely unregulated betting markets. Athletes that wanted to cheat could already do so in relative safety. They would be foolish to start cheating in a situation where both the likelihood of them being caught is increased and the consequences of them being caught is worse.


Most people generally don't have great opsec and forsight when committing crimes. I guarantee you or me would get caught in an attempted scheme because of a single mistake.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: