I miss it already.
I’ve had a bunch of conversations with people since writing a proposal for a new stat site to fill the void left by ExtraSkater.com. There are a lot of exciting developments, so I’ll just use this space to fill you in and solicit more input from you.
First, to all the folks who have left comments and sent emails: thanks! We’ve got a big pool of talented and passionate geeks who are eager to help out. I’ve not responded to anyone yet as I’ve been taking your notes under advisement and speaking to some folks who are working on similar (but not identical) projects. To everyone who is waiting on a reply from me: I should be in getting in touch soon.
Second, there’s some serious competition out there. That’s a very good thing, and I’m using “competition” facetiously. I’ve been on a few email threads with stat geeks and developers who are well on their way to publishing new sites. These sites look primed to meet some of the core features of Extra Skater and even provide a surprising amount of novel stuff. It’s early, but I’d be surprised if at least one of these sites isn’t up by October. It’s exciting, and the people behind these projects should and will be congratulated.
One result of that good news is the appetite and pressing concern for the project I defined– a free, open-source solution for extensible hockey stats– will be much lower. But I don’t think it’s gone entirely.
I’m gonna reiterate my plan and explain why it’s a) still valid, b) novel, and c) kind of easy.
I can’t speak to the specifics of the sites that are in development. Not my place. They may yet be open source. If they’re not, that’s totally fine too.
The project I’m proposing would fully free and open source. It is intended to empower as many analysts and geeks as possible to build their own extensions of the technology. It is intended to be immune to abandonment or being shut down should a creator move on or get hired or whatever. I think there’s a lot of value in that model even if there comes a new bloom of stat sites this fall. If you disagree, please let me know– I don’t want to waste anyone’s time with this.
Assuming you’re still with me, here’s an absurdly simplified explanation of how this site would work. At left you’ve got the NHL’s data, which would get scraped by the site and imported into a new database. That database is then made available to a friendly, simple, and extensible frontend so that the users can see it.
A little more detail on those steps. Feel free to skip.
- Scrape. Enterprising geeks have created and shared (under a GNU public license) a couple options for getting the NHL’s data. These options are written in a different languages: R and Java. These options will have to be evaluated. One will be selected, and it may have to be adapted and automated. Eventually, this scraper will run multiple times per hour to get almost-realtime updates on games. The data gathered by the scraper will be written to the database, which leads me to…
- Storage. After discussion and advice from you nerds, Postgres seems like a better option for database storage. The summary tables, which will be the primary ones accessed by the API and thus the user interface, can use more data formats like JSON that may be more convenient. These tables will have to be designed with intimate understanding of the requirements.
- API. This is basically the brain of the system and arguably the most important piece. The API directly answers most of the requirements of the site (e.g. “give me a report on game X” and “give me possession stats for all forwards in 2012-13 who played more than 300 minutes”). The API will be accessible primarily by the frontend user interface, but also by any other service anyone wants to use. Because this system would be FOSS, someone could hit the API to drive a mobile app, a FanGraphs-type site, write an IFTTT (if this, then that) script, or a totally customized frontend web page. I’m totally neutral on what language would be best for this (Django, Ruby?), though you’ve all thoroughly convinced me that PHP is not the best choice.
So, like I said, there are a lot of shortcuts to accomplish #1, though they need to evaluated. Items #2 and #3 are the core of the project, the guts and the brains respectively. This is where most of the development will happen.
Item #4, the actual user interface, doesn’t necessarily have to be part of the core product. It should certainly be developed to validate the API and demonstrate the functionality, but I think we want to encourage others to build upon the technology and create their own tables and visualizations of the data. If anyone could check out the code from GitHub and then build their own site with their own revenue source, then anyone would have a profit motive to improve the technology and the knowledge base in general.
Update 12:10 PM: The “base” site would not be used to generate revenue through ads. But because the pages, API, and underlying tech would be free/open-source, anyone could create a revenue-generating site derived from the project. This isn’t any different from the scraper solutions already out there– it’s just more convenient.
Because the building of a default UI is kind of simple given the functionality of this hypothetical API, the platform technology afforded by stuff like Twitter Bootstrap, and the number of developers excited to work on it, I think it should still be included–but the magic of the project would be in the API.
At this point, I think it would be inappropriate to raise money to develop this site. There are a number of forthcoming solutions coming that reduce the value of the site as a mere replacement for ExtraSkater.com (which, again, is a good thing). Plus, the number of developers that have approached me offering to volunteer their efforts encourages me that we could do the majority of this project gratis. I could be wrong about that.
While hosting of the source code on GitHub is free, funds will be needed for the actual site once launched. We can deal with that problem later.
So here’s what’s next.
- your name and contact info
- the module(s) you’re interested in
- scraper and data (1 and 2)
- API (3)
- frontend (4)
- your experience (years) and knowledge (preferably R, Java, Django, Ruby, HTML/JS, as well as integrated tech like Google Charts API and Bootstrap)
I’ll get in touch with each of you, and then we’ll work together to set core requirements and make a preliminary plan. We’ll probably collaborate using Google Drive and GitHub, though I’m very open to suggestions. Since I don’t work in Postgres or either of the likely programming languages, we’ll need folks to take the lead on those.
Although I don’t think crowdfunding is necessary, people who are interested in making a buck from this project are still in luck. Once the project is built, you will be welcome and encouraged to create your own site on top of the tech using the API. That’s kind of the point, and there’s a lot of opportunity for exciting stuff to happen with this.
So, what do you think? Email me if you’re in. Comment if you have other thoughts.
Update 12:36 PM: @robbtuftshockey shared a scary story about getting IP blocked by the NHL after doing a long string of uninterrupted requests for data. It might be good manners (and good project planning) to eliminate in-game updates from the feature set and limit data requests to much lower frequency, like daily.