Update on a Replacement for ExtraSkater.com

RIP

I miss it already.

I’ve had a bunch of conversations with people since writing a proposal for a new stat site to fill the void left by ExtraSkater.com. There are a lot of exciting developments, so I’ll just use this space to fill you in and solicit more input from you.

First, to all the folks who have left comments and sent emails: thanks! We’ve got a big pool of talented and passionate geeks who are eager to help out. I’ve not responded to anyone yet as I’ve been taking your notes under advisement and speaking to some folks who are working on similar (but not identical) projects. To everyone who is waiting on a reply from me: I should be in getting in touch soon.

Second, there’s some serious competition out there. That’s a very good thing, and I’m using “competition” facetiously. I’ve been on a few email threads with stat geeks and developers who are well on their way to publishing new sites. These sites look primed to meet some of the core features of Extra Skater and even provide a surprising amount of novel stuff. It’s early, but I’d be surprised if at least one of these sites isn’t up by October. It’s exciting, and the people behind these projects should and will be congratulated.

One result of that good news is the appetite and pressing concern for the project I defined– a free, open-source solution for extensible hockey stats– will be much lower. But I don’t think it’s gone entirely.

I’m gonna reiterate my plan and explain why it’s a) still valid, b) novel, and c) kind of easy.

I can’t speak to the specifics of the sites that are in development. Not my place. They may yet be open source. If they’re not, that’s totally fine too.

The project I’m proposing would fully free and open source. It is intended to empower as many analysts and geeks as possible to build their own extensions of the technology. It is intended to be immune to abandonment or being shut down should a creator move on or get hired or whatever. I think there’s a lot of value in that model even if there comes a new bloom of stat sites this fall. If you disagree, please let me know– I don’t want to waste anyone’s time with this.

Assuming you’re still with me, here’s an absurdly simplified explanation of how this site would work. At left you’ve got the NHL’s data, which would get scraped by the site and imported into a new database. That database is then made available to a friendly, simple, and extensible frontend so that the users can see it.

analytics site diagram

A little more detail on those steps. Feel free to skip.

  1. Scrape. Enterprising geeks have created and shared (under a GNU public license) a couple options for getting the NHL’s data. These options are written in a different languages: R and Java. These options will have to be evaluated. One will be selected, and it may have to be adapted and automated. Eventually, this scraper will run multiple times per hour to get almost-realtime updates on games. The data gathered by the scraper will be written to the database, which leads me to…
  2. Storage. After discussion and advice from you nerds, Postgres seems like a better option for database storage. The summary tables, which will be the primary ones accessed by the API and thus the user interface, can use more data formats like JSON that may be more convenient. These tables will have to be designed with intimate understanding of the requirements.
  3. API. This is basically the brain of the system and arguably the most important piece. The API directly answers most of the requirements of the site (e.g. “give me a report on game X” and “give me possession stats for all forwards in 2012-13 who played more than 300 minutes”). The API will be accessible primarily by the frontend user interface, but also by any other service anyone wants to use. Because this system would be FOSS, someone could hit the API to drive a mobile app, a FanGraphs-type site, write an IFTTT (if this, then that) script, or a totally customized frontend web page. I’m totally neutral on what language would be best for this (Django, Ruby?), though you’ve all thoroughly convinced me that PHP is not the best choice.
  4. Frontend. Because the API is going to be doing most of the hard work, the actual user interface can be very lightweight. Hypothetically, any developer on the project could write a new page to visualize the data provided by the API (or they could write a new API method as well). On the suggestion of some developers, these pages could be light HTML with Javascript — making them scalable. However, the ability to deeplink (i.e. share a link to a page with filters pre-selected using just the URL) was a crucial feature of ExtraSkater and one we’d almost certainly need.

So, like I said, there are a lot of shortcuts to accomplish #1, though they need to evaluated. Items #2 and #3 are the core of the project, the guts and the brains respectively. This is where most of the development will happen.

Item #4, the actual user interface, doesn’t necessarily have to be part of the core product. It should certainly be developed to validate the API and demonstrate the functionality, but I think we want to encourage others to build upon the technology and create their own tables and visualizations of the data. If anyone could check out the code from GitHub and then build their own site with their own revenue source, then anyone would have a profit motive to improve the technology and the knowledge base in general.

Update 12:10 PM: The “base” site would not be used to generate revenue through ads. But because the pages, API, and underlying tech would be free/open-source, anyone could create a revenue-generating site derived from the project. This isn’t any different from the scraper solutions already out there– it’s just more convenient.

Like this:

analytics site diagram

Because the building of a default UI is kind of simple given the functionality of this hypothetical API, the platform technology afforded by stuff like Twitter Bootstrap, and the number of developers excited to work on it, I think it should still be included–but the magic of the project would be in the API.

At this point, I think it would be inappropriate to raise money to develop this site. There are a number of forthcoming solutions coming that reduce the value of the site as a mere replacement for ExtraSkater.com (which, again, is a good thing). Plus, the number of developers that have approached me offering to volunteer their efforts encourages me that we could do the majority of this project gratis. I could be wrong about that.

While hosting of the source code on GitHub is free, funds will be needed for the actual site once launched. We can deal with that problem later.

So here’s what’s next.

If you’re interested in volunteering for the project and haven’t gotten in contact yet, please email me (hassettpm@gmail.com) with the following information:

  • your name and contact info
  • the module(s) you’re interested in
    • scraper and data (1 and 2)
    • API (3)
    • frontend (4)
    • hosting
  • your experience (years) and knowledge (preferably R, Java, Django, Ruby, HTML/JS, as well as integrated tech like Google Charts API and Bootstrap)

I’ll get in touch with each of you, and then we’ll work together to set core requirements and make a preliminary plan. We’ll probably collaborate using Google Drive and GitHub, though I’m very open to suggestions. Since I don’t work in Postgres or either of the likely programming languages, we’ll need folks to take the lead on those.

Although I don’t think crowdfunding is necessary, people who are interested in making a buck from this project are still in luck. Once the project is built, you will be welcome and encouraged to create your own site on top of the tech using the API. That’s kind of the point, and there’s a lot of opportunity for exciting stuff to happen with this.

So, what do you think? Email me if you’re in. Comment if you have other thoughts.

Update 12:36 PM: @robbtuftshockey shared a scary story about getting IP blocked by the NHL after doing a long string of uninterrupted requests for data. It might be good manners (and good project planning) to eliminate in-game updates from the feature set and limit data requests to much lower frequency, like daily.

 
  • Miguel Mora

    I think the database should probably try to go the ‘big data’ route instead of a traditional database like PostgreSQL. Over time, performance can be an issue as the data grows: http://www.zdnet.com/rdbms-vs-nosql-how-do-you-pick-7000020803/

  • Dustin Sier

    This is exceptionally exciting and so very cool to see the stat community rallying around the need.

  • Rotatingearth

    “rallying around the” nerd.

    there, I fixed it for you.

  • John M

    It seemed like a number of the readers here were interested in donating; even without a full-scale crowdsourcing campaign, surely you could use at least a little money (if nothing else to throw the launch party for the RMNB membership) and it would give a way for your non-programmer supporters to contribute in some way?

  • http://www.russianmachineneverbreaks.com/ Peter Hassett

    Definitely, especially for stuff like hosting the beta site. But I don’t wanna raise money without a plan to spend it and I don’t wanna convert people’s general enthusiasm into cash without a clear idea of what the return would be.

  • Rob H

    By having a dedicated host site that overtly scrapes data, is there any concern about the terms of service change? Are there plans for IP spoofing and whatnot to prevent getting blocked by the NHL? I know sites like yelp track IPs very closely and block offenders, but I’m not sure how sophisticated Gary et al will be about it.

    As for non-relational data stores, I have experience in that are and would gladly help out. Although, at 250 plays per game, 1230 games per season it’s only a smidge over 30 million plays over 100 years. That would fit into a relational database with no problem. Seems like a Spark/Hadoop on AWS set up would be overkill.

  • Shaun Phillips

    I think the NHL put out a clarification that specified sites like ExtraSkater were not the target of that ToS change. Have a feeling it was more directed at video feeds.

  • http://www.russianmachineneverbreaks.com/ Peter Hassett

    I addressed the TOS stuff in my first post on the topic: http://www.russianmachineneverbreaks.com/2014/08/19/extraskater-is-dead-lets-build-a-new-stat-site/

    I don’t think it’s a factor. And I would discourage doing anything sneaky to get access to the data. Once the data is scraped, a centralized site could provide access to it via the API.

  • GrnEggsNHam

    https://www.drupal.org/ + http://www.sugarcrm.com/ = the world is your oyster :)

    If your project doesn’t get off the ground within the next 3 months I will create my own for revenue, using Ukrainian developers. I can handle the database and scraping and they would create the front-end with Drupal. Oh and I would use Salesforce instead of Sugar but wanted to offer you guys the open source platform.
    Only reason I can’t do it now is because weddings cost too much fucking money…

  • Josh Willison

    so you essentially want to take NHL data and create an open source API around it? If the NHL data is always there, is a DB necessary? If a DB is necessary for historical data… how do you plan to obtain it in the first place?

  • Rob H

    Thanks for the clarification. The ToS had me uneasy. I had actually started on a distributed app so that people would have a UI from which they could scrape, build their own DB locally and visualize the data from their machine thereby eliminating the risk of a single blocked IP or centralized site closing up. A community maintained mega source is obviously preferable.

  • Taylor

    Maybe someone could come up with a scraper client for users? Just a small program that spreads out the requests to nhl.com among all the people running the client.

  • https://github.com/ThrowsException Chester O’Neill

    You get the data by scraping the web site, extracting the data from the raw html on the site. The DB is there because you only want to scrape for data once and store it. doing Web scraping can be slow it’s also not good to keep hammering a site that you’re scraping data from. That’s frowned upon.