There's No "I" in Data

A blog about data + technology and sometimes other things…

The Algorithmically Perfect College Football Weekend

A few weeks ago I read a great blog by Randal Olson on how to compute the optimal road trip across the United States – it inspired me.

My friends and I have always desired to build an optimal road trip, but not for the classic landmarks and sites across the US, rather, the optimal College football road trip.  Starting in mid September and continuing on through late November the NCAA has games that span Thursday through Saturday (and sometimes Tuesday) every weekend.  As a huge College football fan, wouldn’t it be great to find the optimal route to catch three games in back-to-back-to-back nights?

The Plan

So here’s what you’d have to set out to do:

  • Get the full College football schedule
  • Understand where the games are located (coordinates)
  • Determine travel duration between the games
  • Find the route that has the lowest travel time and best games

This ends up being quite different from Randal’s task of the optimal road trip across the United States because there are fewer possible routes, since you must start at Thursday moving to Friday, then Saturday.  Due to this, the optimal route can be determined through brute force fairly easily and no genetic algorithm is necessary (sadly).

So, before you get excited and scroll down to the results, let’s run through the process.  Every data scientist will tell you that “data wrangling” is where they spend 80% of their time – well, that’s true again in this exercise.

The Data

Let’s start with step 1: the games.  There is no downloadable file of all the games with the football stadium addresses (until now).  So, I have to resort to web scraping.  I found a great Chrome plug in called Web Scraper, this saved me an amazing amount of time, check out the video here.  Using web scraper, I went to the fbsschedules website, hit each team’s schedule, which also happens to have the stadium name for the games.  The output of that web scrape can be downloaded here.

Notice that it’s the stadium name and not the actual address.  Well thanks to Google Maps API, this really didn’t matter.  I followed the same steps as Randal did to determine leg of journey durations and thanks to Google’s great API, it was able to determine routes simply from the stadium names.  Couple fun bugs here though:

  1. I tried to drive to Aloha Stadium (U of Hawaii) – Google didn’t like that.
  2. Some kid in Texas decided to name their high school field in Google Maps the same name as Boise State’s real field.  I don’t understand what happened here, it has the reviews and pictures just like the one in Boise.  I tried to make an edit through Google MapMaker which is a way to crowd source features on Google Maps.  My edit was rejected (do they really think that is a College football stadium, see below image?!).  This only became an issue when running routes between Texas Universities and Boise St, I assume because the Google algorithm saw the “secondary” stadium in close proximity – kind of cool.  I ended up manually editing Boise St from its stadium name to its actual address in my scrape data which resolved the API call – Texas A&M was no longer a 10 minute drive from Boise St.  Here’s an image of my MapMaker edit attempt:

Screen Shot 2015-04-29 at 9.09.09 PM

The Algorithm

Ok, so now I had every possible leg and their durations, time for the fun part.  What we can do now is write some code that iterates through every possible Thursday-Friday-Saturday route and see which route is shortest.  However, there’s another wrinkle here: tailgating.  To really get the full college football experience, you need to be there for the tailgate, of course(!).  Well, this could become a problem on the Saturday road trip, because we must assume we start that trip on Saturday morning and the game could potentially start at noon (most of the start times were TBD).

To account for this, we need to make travel time on Saturday more costly.  Well I took a pretty simple approach to do this, I multiplied every travel hour on Saturday by two, so the shorter the Saturday trip, the more “fit” the entire route will be.  To be honest though, this somewhat happened by default anyway, because there are so few Thursday and Friday games, and so many Saturday games, the algorithm typically picked a close proximity Saturday leg before I did this.

Ok, drum roll.  Here are the top 5 routes for every weekend, which would you choose? (note that week 3 is when Thursday games start):

|||||||||||||||||||||||||||| Week: 3 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 15.1

Thu, 17 Sep, Clemson Tigers @ Louisville Cardinals

Hours: 14.6

Fri, 18 Sep, Florida State Seminoles @ Boston College Eagles

Hours: 0.6

Sat, 19 Sep, Temple Owls @ UMass Minutemen

————————– Alternate:

Total Hours: 18.1

Thu, 17 Sep, Clemson Tigers @ Louisville Cardinals

Hours: 14.6

Fri, 18 Sep, Florida State Seminoles @ Boston College Eagles

Hours: 3.5

Sat, 19 Sep, Wake Forest Demon Deacons @ Army Black Knights

————————– Alternate:

Total Hours: 19.3

Thu, 17 Sep, Clemson Tigers @ Louisville Cardinals

Hours: 14.6

Fri, 18 Sep, Florida State Seminoles @ Boston College Eagles

Hours: 4.7

Sat, 19 Sep, Central Michigan Chippewas @ Syracuse Orange

————————– Alternate:

Total Hours: 21.2

Thu, 17 Sep, Clemson Tigers @ Louisville Cardinals

Hours: 14.6

Fri, 18 Sep, Florida State Seminoles @ Boston College Eagles

Hours: 6.7

Sat, 19 Sep, East Carolina Pirates @ Navy Midshipmen

————————– Alternate:

Total Hours: 21.3

Thu, 17 Sep, Clemson Tigers @ Louisville Cardinals

Hours: 14.6

Fri, 18 Sep, Florida State Seminoles @ Boston College Eagles

Hours: 6.7

Sat, 19 Sep, Rutgers Scarlet Knights @ Penn State Nittany Lions

|||||||||||||||||||||||||||| Week: 4 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 13.5

Thu, 24 Sep, Cincinnati Bearcats @ Memphis Tigers

Hours: 10.9

Fri, 25 Sep, Boise State Broncos @ Virginia Cavaliers

Hours: 2.6

Sat, 26 Sep, Appalachian State Mountaineers @ Old Dominion Monarchs

————————– Alternate:

Total Hours: 14.1

Thu, 24 Sep, Cincinnati Bearcats @ Memphis Tigers

Hours: 10.9

Fri, 25 Sep, Boise State Broncos @ Virginia Cavaliers

Hours: 3.2

Sat, 26 Sep, Georgia Tech Yellow Jackets @ Duke Blue Devils

————————– Alternate:

Total Hours: 14.2

Thu, 24 Sep, Cincinnati Bearcats @ Memphis Tigers

Hours: 10.9

Fri, 25 Sep, Boise State Broncos @ Virginia Cavaliers

Hours: 3.3

Sat, 26 Sep, Delaware Blue Hens @ North Carolina Tar Heels

————————– Alternate:

Total Hours: 14.3

Thu, 24 Sep, Cincinnati Bearcats @ Memphis Tigers

Hours: 10.9

Fri, 25 Sep, Boise State Broncos @ Virginia Cavaliers

Hours: 3.4

Sat, 26 Sep, Indiana Hoosiers @ Wake Forest Demon Deacons

————————– Alternate:

Total Hours: 14.6

Thu, 24 Sep, Cincinnati Bearcats @ Memphis Tigers

Hours: 10.9

Fri, 25 Sep, Boise State Broncos @ Virginia Cavaliers

Hours: 3.7

Sat, 26 Sep, Virginia Tech Hokies @ East Carolina Pirates

|||||||||||||||||||||||||||| Week: 5 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 19.7

Thu, 1 Oct, Miami (FL) Hurricanes @ Cincinnati Bearcats

Hours: 13.2

Fri, 2 Oct, Memphis Tigers @ USF Bulls

Hours: 6.5

Sat, 3 Oct, North Carolina Tar Heels @ Georgia Tech Yellow Jackets

————————– Alternate:

Total Hours: 19.7

Thu, 1 Oct, Miami (FL) Hurricanes @ Cincinnati Bearcats

Hours: 13.2

Fri, 2 Oct, Memphis Tigers @ USF Bulls

Hours: 6.5

Sat, 3 Oct, Liberty Flames (HC) @ Georgia State Panthers

————————– Alternate:

Total Hours: 19.8

Thu, 1 Oct, Miami (FL) Hurricanes @ Cincinnati Bearcats

Hours: 13.2

Fri, 2 Oct, Memphis Tigers @ USF Bulls

Hours: 6.6

Sat, 3 Oct, South Alabama Jaguars @ Troy Trojans

————————– Alternate:

Total Hours: 20.2

Thu, 1 Oct, Miami (FL) Hurricanes @ Cincinnati Bearcats

Hours: 13.2

Fri, 2 Oct, Memphis Tigers @ USF Bulls

Hours: 7.0

Sat, 3 Oct, San Jose State Spartans @ Auburn Tigers

————————– Alternate:

Total Hours: 20.3

Thu, 1 Oct, Miami (FL) Hurricanes @ Cincinnati Bearcats

Hours: 13.2

Fri, 2 Oct, Memphis Tigers @ USF Bulls

Hours: 7.2

Sat, 3 Oct, Alabama Crimson Tide @ Georgia Bulldogs

|||||||||||||||||||||||||||| Week: 6 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 18.5

Thu, 8 Oct, SMU Mustangs @ Houston Cougars

Hours: 16.9

Fri, 9 Oct, Southern Miss Golden Eagles @ Marshall Thundering Herd

Hours: 1.6

Sat, 10 Oct, Miami (OH) RedHawks (HC) @ Ohio Bobcats

————————– Alternate:

Total Hours: 19.6

Thu, 8 Oct, SMU Mustangs @ Houston Cougars

Hours: 16.9

Fri, 9 Oct, Southern Miss Golden Eagles @ Marshall Thundering Herd

Hours: 2.6

Sat, 10 Oct, Maryland Terrapins @ Ohio State Buckeyes

————————– Alternate:

Total Hours: 20.1

Thu, 8 Oct, SMU Mustangs @ Houston Cougars

Hours: 16.9

Fri, 9 Oct, Southern Miss Golden Eagles @ Marshall Thundering Herd

Hours: 3.2

Sat, 10 Oct, Oklahoma State Cowboys (HC) @ West Virginia Mountaineers

————————– Alternate:

Total Hours: 20.4

Thu, 8 Oct, SMU Mustangs @ Houston Cougars

Hours: 16.8

Fri, 9 Oct, NC State Wolfpack @ Virginia Tech Hokies

Hours: 3.5

Sat, 10 Oct, Georgia Bulldogs @ Tennessee Volunteers

————————– Alternate:

Total Hours: 20.5

Thu, 8 Oct, SMU Mustangs @ Houston Cougars

Hours: 16.8

Fri, 9 Oct, NC State Wolfpack @ Virginia Tech Hokies

Hours: 3.7

Sat, 10 Oct, Miami (OH) RedHawks (HC) @ Ohio Bobcats

|||||||||||||||||||||||||||| Week: 7 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 5.5

Thu, 15 Oct, UCLA Bruins @ Stanford Cardinal

Hours: 2.9

Fri, 16 Oct, UNLV Rebels @ Fresno State Bulldogs

Hours: 2.6

Sat, 17 Oct, San Diego State Aztecs @ San Jose State Spartans

————————– Alternate:

Total Hours: 9.5

Thu, 15 Oct, Western Kentucky Hilltoppers @ North Texas Mean Green

Hours: 8.1

Fri, 16 Oct, Houston Cougars @ Tulane Green Wave

Hours: 1.4

Sat, 17 Oct, Florida Gators @ LSU Tigers

————————– Alternate:

Total Hours: 9.9

Thu, 15 Oct, Western Kentucky Hilltoppers @ North Texas Mean Green

Hours: 8.1

Fri, 16 Oct, Houston Cougars @ Tulane Green Wave

Hours: 1.8

Sat, 17 Oct, UTSA Roadrunners @ Southern Miss Golden Eagles

————————– Alternate:

Total Hours: 12.7

Thu, 15 Oct, UCLA Bruins @ Stanford Cardinal

Hours: 11.8

Fri, 16 Oct, Cincinnati Bearcats @ BYU Cougars

Hours: 0.9

Sat, 17 Oct, Arizona State Sun Devils @ Utah Utes

————————– Alternate:

Total Hours: 12.3

Thu, 15 Oct, Auburn Tigers @ Kentucky Wildcats

Hours: 10.9

Fri, 16 Oct, Houston Cougars @ Tulane Green Wave

Hours: 1.4

Sat, 17 Oct, Florida Gators @ LSU Tigers

|||||||||||||||||||||||||||| Week: 8 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 4.1

Thu, 22 Oct, California Golden Bears @ UCLA Bruins

Hours: 2.2

Fri, 23 Oct, Utah State Aggies @ San Diego State Aztecs

Hours: 1.9

Sat, 24 Oct, Utah Utes @ USC Trojans

————————– Alternate:

Total Hours: 7.9

Thu, 22 Oct, California Golden Bears @ UCLA Bruins

Hours: 2.2

Fri, 23 Oct, Utah State Aggies @ San Diego State Aztecs

Hours: 5.7

Sat, 24 Oct, Washington State Cougars @ Arizona Wildcats

————————– Alternate:

Total Hours: 9.2

Thu, 22 Oct, California Golden Bears @ UCLA Bruins

Hours: 2.2

Fri, 23 Oct, Utah State Aggies @ San Diego State Aztecs

Hours: 7.0

Sat, 24 Oct, New Mexico Lobos @ San Jose State Spartans

————————– Alternate:

Total Hours: 15.5

Thu, 22 Oct, Georgia Southern Eagles @ Appalachian State Mountaineers

Hours: 14.3

Fri, 23 Oct, Memphis Tigers @ Tulsa Golden Hurricane

Hours: 1.2

Sat, 24 Oct, Kansas Jayhawks @ Oklahoma State Cowboys

————————– Alternate:

Total Hours: 9.5

Thu, 22 Oct, California Golden Bears @ UCLA Bruins

Hours: 2.2

Fri, 23 Oct, Utah State Aggies @ San Diego State Aztecs

Hours: 7.3

Sat, 24 Oct, Washington Huskies (HC) @ Stanford Cardinal

|||||||||||||||||||||||||||| Week: 9 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 4.4

Thu, 29 Oct, West Virginia Mountaineers @ TCU Horned Frogs

Hours: 4.2

Fri, 30 Oct, Louisiana Tech Bulldogs @ Rice Owls

Hours: 0.2

Sat, 31 Oct, Vanderbilt Commodores @ Houston Cougars

————————– Alternate:

Total Hours: 5.8

Thu, 29 Oct, West Virginia Mountaineers @ TCU Horned Frogs

Hours: 4.2

Fri, 30 Oct, Louisiana Tech Bulldogs @ Rice Owls

Hours: 1.6

Sat, 31 Oct, South Carolina Gamecocks @ Texas A&M Aggies

————————– Alternate:

Total Hours: 6.3

Thu, 29 Oct, Texas State Bobcats @ Georgia Southern Eagles

Hours: 5.0

Fri, 30 Oct, Louisville Cardinals @ Wake Forest Demon Deacons

Hours: 1.3

Sat, 31 Oct, Miami (FL) Hurricanes @ Duke Blue Devils

————————– Alternate:

Total Hours: 6.6

Thu, 29 Oct, Texas State Bobcats @ Georgia Southern Eagles

Hours: 5.0

Fri, 30 Oct, Louisville Cardinals @ Wake Forest Demon Deacons

Hours: 1.6

Sat, 31 Oct, Troy Trojans @ Appalachian State Mountaineers

————————– Alternate:

Total Hours: 6.6

Thu, 29 Oct, Texas State Bobcats @ Georgia Southern Eagles

Hours: 5.0

Fri, 30 Oct, Louisville Cardinals @ Wake Forest Demon Deacons

Hours: 1.6

Sat, 31 Oct, Clemson Tigers @ NC State Wolfpack

|||||||||||||||||||||||||||| Week: 10 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 7.7

Thu, 5 Nov, Nevada Wolf Pack @ Fresno State Bulldogs

Hours: 2.6

Fri, 6 Nov, BYU Cougars @ San Jose State Spartans

Hours: 5.1

Sat, 7 Nov, Arizona Wildcats (HC) @ USC Trojans

————————– Alternate:

Total Hours: 10.1

Thu, 5 Nov, Baylor Bears @ Kansas State Wildcats

Hours: 7.3

Fri, 6 Nov, Temple Owls @ SMU Mustangs

Hours: 2.8

Sat, 7 Nov, Auburn Tigers @ Texas A&M Aggies

————————– Alternate:

Total Hours: 10.3

Thu, 5 Nov, Baylor Bears @ Kansas State Wildcats

Hours: 7.3

Fri, 6 Nov, Temple Owls @ SMU Mustangs

Hours: 3.0

Sat, 7 Nov, Iowa State Cyclones @ Oklahoma Sooners

————————– Alternate:

Total Hours: 10.4

Thu, 5 Nov, Baylor Bears @ Kansas State Wildcats

Hours: 7.3

Fri, 6 Nov, Temple Owls @ SMU Mustangs

Hours: 3.1

Sat, 7 Nov, Kansas Jayhawks @ Texas Longhorns

————————– Alternate:

Total Hours: 10.8

Thu, 5 Nov, Baylor Bears @ Kansas State Wildcats

Hours: 7.3

Fri, 6 Nov, Temple Owls @ SMU Mustangs

Hours: 3.5

Sat, 7 Nov, New Mexico State Aggies @ Texas State Bobcats

|||||||||||||||||||||||||||| Week: 11 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 21.4

Thu, 12 Nov, Virginia Tech Hokies @ Georgia Tech Yellow Jackets

Hours: 20.2

Fri, 13 Nov, USC Trojans @ Colorado Buffaloes

Hours: 1.2

Sat, 14 Nov, UNLV Rebels @ Colorado State Rams

————————– Alternate:

Total Hours: 21.7

Thu, 12 Nov, Virginia Tech Hokies @ Georgia Tech Yellow Jackets

Hours: 20.2

Fri, 13 Nov, USC Trojans @ Colorado Buffaloes

Hours: 1.5

Sat, 14 Nov, Utah State Aggies @ Air Force Falcons

————————– Alternate:

Total Hours: 22.4

Thu, 12 Nov, Louisiana’s Ragin’ Cajuns @ South Alabama Jaguars

Hours: 21.2

Fri, 13 Nov, USC Trojans @ Colorado Buffaloes

Hours: 1.2

Sat, 14 Nov, UNLV Rebels @ Colorado State Rams

————————– Alternate:

Total Hours: 22.7

Thu, 12 Nov, Louisiana’s Ragin’ Cajuns @ South Alabama Jaguars

Hours: 21.2

Fri, 13 Nov, USC Trojans @ Colorado Buffaloes

Hours: 1.5

Sat, 14 Nov, Utah State Aggies @ Air Force Falcons

————————– Alternate:

Total Hours: 28.8

Thu, 12 Nov, Virginia Tech Hokies @ Georgia Tech Yellow Jackets

Hours: 20.2

Fri, 13 Nov, USC Trojans @ Colorado Buffaloes

Hours: 8.6

Sat, 14 Nov, Kansas State Wildcats @ Texas Tech Red Raiders

|||||||||||||||||||||||||||| Week: 12 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 5.6

Thu, 19 Nov, East Carolina Pirates @ UCF Knights

Hours: 1.8

Fri, 20 Nov, Cincinnati Bearcats @ USF Bulls

Hours: 3.9

Sat, 21 Nov, Georgia Tech Yellow Jackets @ Miami (FL) Hurricanes

————————– Alternate:

Total Hours: 5.8

Thu, 19 Nov, East Carolina Pirates @ UCF Knights

Hours: 1.8

Fri, 20 Nov, Cincinnati Bearcats @ USF Bulls

Hours: 4.0

Sat, 21 Nov, Western Kentucky Hilltoppers @ FIU Golden Panthers

————————– Alternate:

Total Hours: 5.9

Thu, 19 Nov, East Carolina Pirates @ UCF Knights

Hours: 1.8

Fri, 20 Nov, Cincinnati Bearcats @ USF Bulls

Hours: 4.2

Sat, 21 Nov, Chattanooga Mocs @ Florida State Seminoles

————————– Alternate:

Total Hours: 8.3

Thu, 19 Nov, East Carolina Pirates @ UCF Knights

Hours: 1.8

Fri, 20 Nov, Cincinnati Bearcats @ USF Bulls

Hours: 6.5

Sat, 21 Nov, South Alabama Jaguars @ Georgia State Panthers

————————– Alternate:

Total Hours: 8.7

Thu, 19 Nov, East Carolina Pirates @ UCF Knights

Hours: 1.8

Fri, 20 Nov, Cincinnati Bearcats @ USF Bulls

Hours: 7.0

Sat, 21 Nov, Idaho Vandals @ Auburn Tigers

|||||||||||||||||||||||||||| Week: 13 ||||||||||||||||||||||||||||

————————– Top Option:

Total Hours: 2.9

Thu, 26 Nov, Texas Tech Red Raiders @ Texas Longhorns

Hours: 2.7

Fri, 27 Nov, Navy Midshipmen @ Houston Cougars

Hours: 0.2

Sat, 28 Nov, Charlotte 49ers @ Rice Owls

————————– Alternate:

Total Hours: 3.7

Thu, 26 Nov, Texas Tech Red Raiders @ Texas Longhorns

Hours: 2.9

Fri, 27 Nov, Baylor Bears @ TCU Horned Frogs

Hours: 0.8

Sat, 28 Nov, UTEP Miners @ North Texas Mean Green

————————– Alternate:

Total Hours: 5.6

Thu, 26 Nov, Texas Tech Red Raiders @ Texas Longhorns

Hours: 2.7

Fri, 27 Nov, Navy Midshipmen @ Houston Cougars

Hours: 2.9

Sat, 28 Nov, Middle Tennessee Blue Raiders @ UTSA Roadrunners

————————– Alternate:

Total Hours: 6.7

Thu, 26 Nov, Texas Tech Red Raiders @ Texas Longhorns

Hours: 2.7

Fri, 27 Nov, Navy Midshipmen @ Houston Cougars

Hours: 4.0

Sat, 28 Nov, Texas A&M Aggies @ LSU Tigers

————————– Alternate:

Total Hours: 7.0

Thu, 26 Nov, Texas Tech Red Raiders @ Texas Longhorns

Hours: 2.9

Fri, 27 Nov, Baylor Bears @ TCU Horned Frogs

Hours: 4.1

Sat, 28 Nov, Oklahoma Sooners @ Oklahoma State Cowboys

Conclusion

I’d select the last option in week 7, you start in Lexington (great town) and finish with two nights in/near New Orleans.  Interesting how much the Thursday/Friday games drive the algorithm, but as discussed, this was expected.  They did limit the complexity of the algorithm by limiting the possible road trips – however, this also limited how “fit” any of these routes could actually be (see weeks 5 and 11).  The fittest route is found in our final week, with some small hops around Texas.  I considered adding NFL Sunday to the algorithm, but again, this would limit the possible fitness even further just like Thursday and Friday did.  Plus, my wife would kill me if the trip ran until Monday.

See you in Lexington!

Oh, and if you want to see the source code, it’s found here.

Knapsacks of Legal Gambling!

Have you ever heard of Fanduel or DraftKings?  Some people have full time jobs winning on them…

Basically it’s a twist on classic fantasy football (or fantasy sport x), instead of drafting a single team at the start of the year, you draft short-lived teams on weekend, daily, or even day-segment basis.  Why?  Well, there’s ample opportunity to get it right, instead of classic fantasy football where you have all your eggs in a single basket, you can now draft a team for the Saturday NFL playoff games only, for example, and challenge others for money.

Yes, and it’s totally legal:

 “Fantasy sports is considered a game of skill and received a specific exemption from the 2006 Unlawful Internet Gambling Enforcement Act (UIGEA 2006).”

A game of skill.

So, before I get to the meat of testing the game of skill, here’s how it works at one thousand feet:  Each available player has a value associated to them, and you draft your team using a salary cap assigned to the contest (a contest made up of one or more of the same sporting events which contain the players you are choosing).  It’s like shopping for players and you have a limited amount of fake cash to buy them (what I like to call space bucks).  You must also fill a certain amount of roster spots.  So a typical College football contest could have players that range from 4.5 – 11 space bucks, and you have to fill a roster of QB (1), RB (2), WR (3), TE (1) and have a salary cap of 45 space bucks.  Then you take your roster and put it up against other player(s), as a bet, and that bet can range from $1 to thousands (that’s real money, not space bucks)…and of course, just like Vegas, Fanduel and DraftKings keep a slice of the winnings for themselves.

If you’re a machine learning guy competing on Kaggle, you really should be looking here too, in my opinion.  Here’s why, you are playing other players and not the house.  In other words, you can do head-to-head games, and you never know who you’re going to get – there’s no weight class or handicap.  So, if you can master the statistics behind your roster selections in these smaller windows provided by these games, you likely have an advantage against opponents relying on heuristics and luck (more on luck later).   One of my favorite blog posts on a similar subject comes from Mike Greenfield, you should read it, but the gist of it is, why battle to eek out victories in efficient markets like the stock market or Vegas, when you can win big in inefficient markets like NCAA tournament bracket pools with your buddies who are making irrational decisions?

Ok, so to the meat, there’s actually two things you need to solve here:

1. Create some kind of metric to assign a value to each player

2. Determine the perfect roster that optimizes value but remains under your space buck salary cap

The point of this blog is #2.  My buddy Matt and I spent too much time last week gathering statistics and building a value score for each player, and then, the night before the games, we had to pick our roster and were brute force doing it by hand to achieve a roster that looked to have the best value within cost.

So, for you computer science guys, does this sound familiar?  Albeit modified, it’s the classic Knapsack problem:

 “The knapsack problem or rucksack problem is a problem in combinatorial optimization: Given a set of items, each with a mass and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible. It derives its name from the problem faced by someone who is constrained by a fixed-size knapsack and must fill it with the most valuable items.”

500px-Knapsack.svg

There are 3 types of Knapsack problems I’m aware of, none of which match our problem at hand, but are a good starting point for solving #2; those options are:

Unbounded: there are unlimited players and you can take duplicates to get the most value for cost (you could have a roster of 5 Andrew Lucks)

0-1: there are only one of each player, and you need select the best combination to get the most value for cost (you could have a roster with the wrong amounts of positions, like 4 QBs)

Bounded: there are varying quantities of individual players, and you need to select the best combination to get the most value for cost (There could be 2 Andrew Lucks, 5 Tom Bradys to choose from and you end up with the problems associated with unbounded and 0-1)

All of the above have algorithms to solve them efficiently, but unfortunately, none of them match the extra constraint we have with our problem which is there must be a set amount of player types, in this case types are positions.  I call this fourth Knapsack problem:

0-1 Type-Bounded: There are only one of each player, but we need to have a set amount of positions in our knapsack to fill out a valid knapsack.  In this knapsack problem, our items have a fourth characteristic: item, value, cost, type

So I needed to create a new algorithm, and of course the best start point is looking at how the known algorithms work and then modifying one.  However, after messing around with 0-1 for too long I was stuck; but then I read this post.  I don’t know about you, but just examining code to extend or modify it never really works for me, I have to have some prose describing what it is really trying to do, well Mr. Givens in this post gave me my breakthrough when he eloquently states:

“The unbounded knapsack problem is fairly easy to solve:

  1. Determine the value-weight ratio of every item.
  2. Starting with the highest value-weight ratio item, place as many of this item as will fit into the sack.
  3. Move onto the next-highest value-weight item and repeat step 2 until the sack is full or there are no other items.”

The ah-ha part for me here was that by sorting by value-weight ratio (in our case value-cost ratio) largest to smallest and adding it to the knapsack actually solves the unbounded problem – voila.  All I needed to do now was use that logic, except skip over positions that are already filled and don’t add the same player more than once.

So:

Create a sorted list, largest to smallest, of all the players using their value-cost ratio.

Iterate through that list adding the player to the roster, but only if the roster is not already filled for that position.

BOOM.  It works, and it’s fast.  Here’s the code:

public class TypeBoundedKnapsack 
{
	public static void main(String[] args)
	{
		Map<String,Integer> max = new HashMap<String,Integer>();
		max.put("QB", 1);
		max.put("RB", 2);
		max.put("WR", 3);
		max.put("TE", 1);

		Map<String,Integer> counts = new HashMap<String,Integer>();
		counts.put("QB", 0);
		counts.put("RB", 0);
		counts.put("WR", 0);
		counts.put("TE", 0);

		Roster r = new Roster();
		double maxRosterCost = 50000;

		File file = new File(args[0]);
		Set<Player> players = populatePlayers(file, max);

		for (Player p : players)
		{
			if (counts.get(p.position) < p.maxOnRoster && r.getCost() <= maxRosterCost)
			{
				r.players.add(p.copy());
				int rosterCount = counts.get(p.position);
				rosterCount++;
				counts.put(p.position, rosterCount);
			}
		}

		System.out.println(r.toString());
	}

	private static Set<Player> populatePlayers(File file, Map<String,Integer> counts)
	{
		Set<Player> players = new TreeSet<Player>();
		BufferedReader br;
		try {
			br = new BufferedReader(new FileReader(file));

			String line;
			while ((line = br.readLine()) != null) 
			{
				String[] playerData = line.split("\t");
				int maxOnRoster = counts.get(playerData[0]);
				Player p = new Player(playerData[0], 
						playerData[1], 
						Double.parseDouble(playerData[2]), 
						Double.parseDouble(playerData[3]),
						maxOnRoster);
				players.add(p);
			}
			br.close();
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}

		return players;
	}
}

And the Players:

public class Player implements Comparable<Player>
{
	public double cost;
	public double value;
	public String name;
	public String position;
	public int maxOnRoster;
	
	public Player()
	{
		
	}
	
	public Player(String position, String name, double cost, double value, int maxOnRoster)
	{
		this.position = position;
		this.name = name;
		this.cost = cost;
		this.value = value;
		this.maxOnRoster = maxOnRoster;
	}

	public int compareTo(Player p) {
		//I want in reverse order, highest to lowest
		return new Double(p.value / p.cost).compareTo(new Double(this.value / this.cost));
	}

	protected Player copy() 
	{
		Player p = new Player();
		p.cost = this.cost + 0.0;
		p.value = this.value + 0.0;
		p.name = this.name + "";
		p.position = this.position + "";
		p.maxOnRoster = this.maxOnRoster + 0;
		
		return p;
	}
	

}

And the Roster:

public class Roster {

	public List<Player> players = new ArrayList<Player>();
	
	public double getCost()
	{
		double totalCost = 0;
		for (Player p : players)
		{
			totalCost += p.cost;
		}
		return totalCost;
	}
	
	public double getValue()
	{
		double totalValue = 0;
		for (Player p : players)
		{
			totalValue += p.value;
		}
		return totalValue;
	}

	@Override
	public String toString() 
	{
		String team = "";
		for (Player p : players)
		{
			team += p.position + ": " + p.name + "\n";
		}
		
		return "Total Cost: " + this.getCost() + "\n" + 
				"Total Value: " + this.getValue() + "\n" + 
				team;		
	}
}

So, remember I said I’d get back to the luck part?

Well, the first time we tried this we went 12-6. Cool.

The next time, 1-15.  Ouch.

Since we’ve created the algorithm for solving the perfect roster combination, it’s either a poor method we are applying to create our “value” score for the players (#1 above), or it’s just bad luck.  Probably a little of both.

Or maybe the guys we played were just more skilled than us?

Santa MapReduce

Today the kids and I were discussing how fast Santa has to travel to get to everyone’s house in one night.  I asked some questions and got the typical “1 million hundred times fast daddy!”…but this got me thinking.

Santa’s elf engineers need to get a Hadoop cluster.

Santa’s Christmas eve duties are a perfect Hadoop MapReduce use case.  I always have trouble explaining MapReduce, and the obligatory word count example, well, it’s just counting words, it doesn’t resonate with anyone.  We’ve heard “MapReduce brings the code to data”, or “you can analyze all your data at once”.  To the lay person that can barely spell Hadoop, this is pretty meaningless, and if you read one of my earlier posts, just “buying” Hadoop doesn’t get you anywhere.  I like this use case, so I figured I’d put it into words.

So here we are, another Christmas eve, and Santa has to visit every child’s house (well, Christian child) in sequence.  If Santa’s journey tonight was done in code, it would look something like this:

while ( more Christian houses ) –> get children –> for each child –> get child wish list –> for each wish –> deliver gift

Santa probably can’t even fit all of that into memory to even have this code work, but even if he could, it would take a while, even in code.  This would be a classic vertical scaling solution, beef up the server with memory and processing, and you probably could run this fast…but it will never scale as far as the horizontal approach, aka Hadoop MapReduce.

As you may see, this is a perfect use case for parallelization, and really, that’s all MapReduce is, a very tidy abstraction of parallelization, it’s an easy to understand pattern where you map by something, and then reduce by what you just mapped.  Huh?

Ok, so think of it this way, let’s pretend that Santa could pick up houses and drop them in several available “processing plants” manned by hundreds of elves, those processing plants can deliver gifts to the house, and then put the houses back.  So what would be the most efficient way to organize which houses go to which of Santa’s processing stations?

By the wished-for gift.

By the way, you just created your first MapReduce mapper, you map by the wished gift.  So, let’s back up for a minute and look at the data so this can make a little more sense, here is a sample file Santa has to work with:

House 1, Bobby, Train

House 1, Lucy, Doll

House 1, Lucy, Train

House 2, Steve, Train

House 3, Lily, Train

House 3, Isla, Doll

For example sake, let’s pretend there’s only 2 gifts that Santa can deliver, looking at the above list, and using the logic we just came up with, we only need two processing stations, one for Trains and one for Dolls.  House 1 would actually go to two stations (yep, we are now duplicating houses, but they are already duplicated in the file).  House 2 would go to the Train processing station only, House 3 would also go to both.

We could have mapped by house, but this would have been inefficient, because there are 3 houses and only 2 gifts, so we would have needed an extra processing plant…oh, and by the way, this is already what Santa is doing – each house.  However, it would still be faster using this paradigm since he’d be doing the houses in parallel at least, but not as efficient as having processing plants only focused on gifts.  It also makes the code a lot less complicated…for example:

If we map by house, the processing plant has to:

get children –> for each child –> get child wish list –> for each wish –> deliver gift

If we map by gift, the processing plant has to only:

–> deliver gift

There is a downside to this though, think about if we had 2 billion houses and only two possible gifts.  Each processing plant now has to deliver ~1 billion gifts to all the houses flowing to it…we are back to a memory issue.  Of course, there’s millions of gifts, so this isn’t a real issue for Santa, but it could be an issue in a different use case that needs to be considered.

So now we have two sets of code, the mapper:

parse gift from the above file –> key on this for delivery to the processing plant

the reducer (the processing plant):

–> deliver gift

So, you’re probably wondering about where that mapper code runs…think of it this way, and this is what people mean when they say you “push the code to the data”, that mapper code will get pushed out to every Christian neighborhood (or file in HDFS) and run that process.  The code is running where the data/file lives on the server, and these are all running in parallel at all the neighborhoods at once – think of it as the elves going out to every neighborhood to do the work at once.  However, depending on your Hadoop cluster size (the amount of survey elves you have), some of the neighborhoods will have to wait in line while the elf finishes the one in front of it.  (Note there is also a limit to processing plants based on your cluster size, so processing plants may not exist until others are done delivering gifts).

So as an elf processes a house, it’s picked up and copied to one or more processing plants.  So here’s where some slow down can occur because of the copying and shuffling.  And you really don’t want to copy the entire house if you can avoid it – keep these objects small.

There’s some other considerations here, like if you “map” by the gift, how do you know which house it belongs to (or even which kid) in the processing plant?  You basically lose that history if all you send is the gift to the processing plant (the reducer).  You can do things like add the house and kid as part of the “value” that is sent, like gift and the array of houses:kids that wished it.  In fact, that’s why the word count example is elegant, because if all you’re doing is counting, you don’t need to worry about mapping back to which book it came from.  Another important point here is that this is a separate output from the processing plant (reducer), you are not actually putting the house back, you are just delivering the gift in the house clone that went to the plant…are we talking mirror universes here??

To recap, Santa’s Christmas eve now goes like this:

Santa’s survey elves go out to every neighborhood (file) in the world (HDFS) and determine the wished for gifts and transport that house to the appropriate processing plant based on the gift(s), which may include clones of the house (mapper).

Each processing plant then receives the houses and drops in the gift it is responsible for, a pretty simple task.  Note that the more complex task is left to the survey elves, since they are local to the houses (data locality) and they only send what the processing plant needs to do it’s job (reducer).

Ok, that sounds even crazier than what I just told my kids, but hell, it would work a lot better.  Now I’m going back to my Christmas eve champagne, my wife is starting to look at me funny.

Dreams: Low Latency Hadoop

Hadoop works wonderfully, but it’s passé…the future is ___________

Streaming?

In-Memory Db?

Something else?

This really should be rephrased to Hadoop works wonderfully, but I need a low-latency solution that allows on-the-fly exploration of data.  So what is that?  It’s a combination of Hadoop and a sidecar that helps with this real time exploration, your “big data architecture”.

If you are worried about low-latency, then your problem just became a lot more complex, because you’ve now introduced the classic consistency problem.  A great and popular blog entry by Nathan Martz describes the Lamba architecture, and notes that Hadoop can solve the consistency issue by relaxing the need for low latent results.  However, most scenarios require the most up to date results possible, and hence the search for a Hadoop+ architecture.

First off:

  • There is no single architectural solution.
  • There is no “next thing” that is so much better or obvious.

Here’s a typical scenario requiring a big data architecture, from nearly all my customers.  “I want to ask complex multi-dimensional questions against huge data sets”.  Key to this is the fact that the questions they want their data to answer, aren’t necessarily explicit components in the data.  What I mean by that is part of the multi-dimensional questions include derived data, from potentially disparate sources.

Hey, a great scenario for Hadoop.

So, back in 2011, we started writing custom Hadoop MapReduce jobs for every possible question an analyst came up with.  Yeah, that’s scalable [insert sarcasm here].  It’s actually horrible on multiple levels, the analyst don’t trust you did it right (black box), you likely get too many results, and once you’re done, the job is difficult if not impossible for others to leverage.

So what to do?

We built Amino.  Bear with me, Amino alone doesn’t solve the issue of latency in Hadoop (today), but is a building block to do so.

We noticed that many of our custom MapReduce jobs were built from the same boilerplate components.  Hey, what if we pre-computed those components…and we indexed those in a way that allows analysts to mix and match them on the fly to ask these complex multi-dimensional questions as a query.

The special sauce, from a technical standpoint, is the Amino index, which is fully scalable and provides sub-second queries regardless the amount of dimensions.  This is enabled through Accumulo iterators in combination with compressed bit vectors.

The special sauce, from a concept standpoint, is that developers from across an organization can create components (through MapReduce), and then contribute those to the Amino index.  Now developers, that know nothing of each other, could have contributed components that together answer an analyst’s question.

The flow goes like this: at some increment, the contributed jobs run and spit their results to HDFS, once all the jobs complete, Amino sucks up all the results and does some conversions to index them in the “special” index.

I don’t want to downplay the power here.  We were able to enable analysts to answer questions in seconds that prior took days to weeks or were not possible at all.  But that’s not good enough, there are two issues:

1. Since the components are derived data, typically aggregates across the entire set, the jobs must run across everything, every time.  Lamda Architecture attempts to solve this through sub aggregates, like storing results by hour and then aggregating up at query time.  I would argue this is either severely limiting (only do counts by hour), or requires insanely complex logic in the query engine to account for every possible roll-up, particularly when considering multiple different MR outputs (the point of Amino).

2. Your results are only as new as the last time you ran the job(s).

This is where we are today, but I believe this is solvable in a Lambda-kinda way.  In the Lambda architecture, the final step is to aggregate your real time data with the batch data as a query time roll up.  As written in a great blog by Jay Kreps, he describes the Lambda downfall being the maintenance of the code in the real-time and batch in parallel.  I would pile on that statement and say, as previously mentioned, what about the roll up code?  If you are doing anything more complex than count by hour, this is going to get super hairy, even if you don’t have the real time stuff to worry about.

Well, what if we went ahead and did the aggregating (roll ups) in real time?

So there’s this thing called iterators in Accumulo, I mentioned them earlier, they are used to do our bitwise operations in the Amino index to enable our multi dimensional queries.  Think of an iterator as a mini reduce step.  At query time, your “map” is the keys of the query, the “reduce” is the iterator.  These can also run on insert.

So, take a step back and remember Amino is a bunch of MapReduce results from different developers that act as components for analyst questions.  Well, we could instead pre-set those components in the index, and update them with the iterators as data arrives.  So, instead of the Amino API abstracting MapReduce, it should abstract Iterator roll-ups that occur at insert time.

So why is this better?

Remember that Amino offloaded the creation of components to various developers, they could do whatever they wanted in those jobs, as long as the output met the conventions to be effectively indexed in the Amino index.  Similarly, if the developers implemented their own iterator roll ups, they can do whatever they want in there, as long as the output meets the requirements of the query.

This removes the issue of complex logic in the query time roll-ups and also removes Jay Kreps’ point of needing code to output the same thing in batch and stream.

But alas, as Jay mentions, iterators, just as stream processing, is not immediately consistent, and as Lamba, CAP theorem still exists.

The team is working it now…we’ll let you know how it goes.

 

Quick update (8/15/2014): Reading the Google Mesa paper and there are some striking similarities with what I describe to their data model, specifically versioned aggregation technique.

 

1 Picture = 1,000 Words. 1 Picture x Earth’s Surface x 365 x 2 = Knowledge

Look at an overhead picture of anywhere at any time.

Sit back and think about the possibilities for a second.

Thinking about going to the gym?  Check the parking lot to see how crowded it is first.

Wondering what side of the road the accident ahead of you is on?  Look at the image of the road ahead.

Trying to find where your buddies are tailgating?  Look for their car before you drive over.

Long ago, in an old job I can’t talk about, high resolution pictures taken from outer space was magic stuff…

It wasn’t that long ago, though.

Now, we take it for granted that we can look at an image of our house on Google Maps, or even a street view of the address for your meeting tomorrow.

Typically these satellite images are years old, maybe months if you’re lucky.  Google buys them from a commercial satellite company like DigitalGlobe (another previous employer) and fuses them into their globe for your enjoyment – the acquisition, transfer, and fusion takes time, so you’ll never see an image hours old.  Hi my house last summer!

Until now.

Well, not now, but really soon.

Satellites are the easy part, at least it seems like it is now.  Check out the latest Google acquisition, Skybox.  What does Google want with a Satellite company? — more like, what doesn’t Google want with a Satellite company!?

In fact, they aren’t a satellite company at all, and their founders agree.  They believe they are creators of knowledge – from data that happens to be collected across the globe, twice a day by 2016.  Will you be able to get it in minutes or on-demand by 2020?  That doesn’t seem to be crazy talk anymore.

You want to talk about big data – that’s a hell of a lot of pixels to store and index every day.  This is the kind of massive data dreams are made of.  Not only must Skybox and Google figure out how to index this data effectively for rapid retrieval (wonder if Google will do that well?), but they must be able to analyze it.

This isn’t some college grad on a light table – no, to create knowledge, the image data must be analyzed in automated ways, at scale.

Talks of a developer API for this data has been rumored.  Maybe that means on-demand query of image by location at first.  But let your imagination go crazy for a second — what about running algorithms over the same orthorectified area over the past year to create time series statistics.  That starts getting really complex for the API provider, a framework that not only provides access to that data, but scales your algorithms to process the data.  Dream job?

The future should be about providing not only data services, but data-analytic services.  Developer APIs should be about the logic of what they are trying accomplish, only.  Let the API service provider worry about data access, scaling of jobs, and indexing of results.

So, yeah, Google could keep this to themselves and build their own hedge fund based on this new source of knowledge no one else has (and also build some damn good maps).  Or, they could open up a platform API that allows retrieval and analysis of this data to build knowledge I can’t even imagine.

I hope they do the latter.

Quick update (7/30/2014):  Just saw Will Marshall of Planet Labs’ exceptional talk at OSCON, very exciting stuff…looks like this is going to happen, here’s the video.

It’s all about the data

Hadoop, Storm, Hbase, Cassandra, … – powerful stuff, it’s the new age of distributed big data tools that anyone that’s anyone should be using.

However, we’ve all heard something like this before: “My CIO said I had to buy some Hadoop, so now we are using HDFS as our data backup system because we didn’t know what else to do with it”.

Yup.  If you don’t think that’s happening – ignorance is bliss.

Technology pivots are scary.

Leadership understands the value of big data, however, the risk in moving whole hog is great.  Instead, big data typically ends up in the corner wearing birkenstocks – a side project that’s never going to see the light of day.  It was doomed from the start…not because it isn’t great, but because it’s not solving a real problem.  With the exception of the Googles, Facebooks, Linkedins of the world – those that know their product depends on it – the big data technology pivot is scary, damn scary.

Why?

Learning Curve.

Security.

Let’s start with security first.  If you’re going to “do” big data, build correlations, make predictions, drive more value to your customers – your data is going to get you there, but that’s never going to happen in data stovepipes across a large organization.  SOA pretends to solve this, but you’ll never get there with SOA – this is worth a whole other blog, but if someone claims they can MapReduce across a SOA backed enterprise system, good luck.

To me, the big data technology pivot is about commoditizing your data.  It’s not necessarily the tool (Hadoop, Storm, etc), but the fact that your developers and analysts can now “play” with all the data – easily; you’ve created data-as-a-service for your organization.  In a SOA world, or even worse stovepiped world, you’ve added data procurement and refresh, you’ll never get where you need to be.

Commoditization of data sounds great, right?  Except that not everyone in your organization has authorization to every piece of data.  You need fine grained access controls.  It’s not that NoSQL can’t give you this, it can, just like relational databases can – it’s a context problem.  The big data tools are there to build value the relational databases are there to build an application.  The latter feels safer, doesn’t it?

Stove-piped applications are dead.

The killer app is the fact that you can rapidly build lenses into your data in minutes and add functionality in days – because you’ve commoditized your data.  Applications are lenses to your data, they are a mashup of functions and views – it’s all about the data.

Learning curve – this one is pretty simple, the talent just isn’t there in most organizations.  This is a combination of open source moving so fast, fear of the technology pivot, and churn of developers working legacy applications.

Unfortunately you can’t just buy Hadoop and expect magic to happen.  You need to invest the developer capital on top of that Hadoop investment.  Administration has also been a challenge, distributed applications are not simple to install, monitor, nor maintain.

In recent history, there has been a push to make the administration leg easier.  Tools like Puppet, Mesos, YARN, Kubernetes, Slider, and Docker are saving head pounding around the world.  On demand, as a service big data tools are here, even on-premise, this is the first step in collapsing the learning curve problem.

What’s left to do?

We need the glue between the applications and the data tier.

API.

By providing an abstraction API on top of your data tier, you can enforce security while also commoditizing your data.  This isn’t a new idea, Google App Engine, Cloudant, and Continuuity have touched this.  Focus on indexing patterns, developers shouldn’t need to worry about if it’s Hbase or Cassandra under the covers.  However, the current hosted solutions don’t cut it.  It ends up being a hosted stovepipe.

And no, this isn’t the same thing as SOA.  The abstractions provide fine grained query, in a common way, but also coarse grained access for large scale analytics.  You aren’t keeping up with some other application’s API, you are all working from the same cloth.

Our team has spent the last year working on EzBake.  The idea behind EzBake is to create that abstraction layer API between your apps and your data that also enforces object level security.  This allows technology swap under the API without breaking the API contract with the developers.

More importantly, the abstraction layer allows us to distribute questions across the data returning results – with application context.  We now have indexes to enable application building (the lenses) but at the same time, provide the data commoditization.  What the heck, we also threw in the classic PaaS provisioning of the applications and abstraction layer.

Easy.  Secure.

EzBake will be open sourced soon.  Our hope is the community can contribute to our software; we call it a data productivity platform, not PaaS.  We just have the starting point, there’s work left to do, and we need to do it together.

Let’s make big data a happy place with unicorns and rainbows.  It’s too scary right now.

After all, it’s all about the data.