• Follow us on Twitter @buckeyeplanet and @bp_recruiting, like us on Facebook! Enjoy a post or article, recommend it to others! BP is only as strong as its community, and we only promote by word of mouth, so share away!
  • Consider registering! Fewer and higher quality ads, no emails you don't want, access to all the forums, download game torrents, private messages, polls, Sportsbook, etc. Even if you just want to lurk, there are a lot of good reasons to register!

Data Mining Request.....

cdiddy70

Hall of Fame
At the risk of sounding like sour grapes.
Though I mentioned it before the game.
Can someone determine how many times if ever (in the B1G) a team has played 2 consecutive road conference night games v teams off a bye?

Send the results to Jeans and Delaney whose Network doesn't exist without us.
 
Back from vacation. If anyone wants to help with this, finding a site with "clean" schedules would be very helpful. Preferably the schedule would note the Bye weeks, this would make it far easier then writing code to determine bye weeks. It wouldn't be so bad if Thursday/Friday games didn't exist. By clean I mean just the schedule in text, not all the pictures and links etc.
 
Upvote 0
Back from vacation. If anyone wants to help with this, finding a site with "clean" schedules would be very helpful. Preferably the schedule would note the Bye weeks, this would make it far easier then writing code to determine bye weeks. It wouldn't be so bad if Thursday/Friday games didn't exist. By clean I mean just the schedule in text, not all the pictures and links etc.
Looked for this a bit, but the closest I could find was CFB Data Warehouse. Unfortunately, they categorize by a 5-year window, not throughout. Take a look at their 1990-94 results for tOSU here. Looking at the page source, it isn't as modern. HTML tags are weird, no CSS. But there might be a HTML reading package in some language. I don't know.

How are you planning on approaching this? Are you using advanced Python/Django?
 
Upvote 0
Thinking about this, this would be much simpler if this was in XML. Maybe using a DOM of StaX package in Java to access this would be simple. Is there a way to convert the data we need into an XML document or is there a simpler way to do it?

In any case, the first step is to read from the website, and I don't know how to do that.
 
Upvote 0
You are giving me too much credit...I'm pretty novice. I use excel to read from websites, then sort and compute data.

I did this on a large scale using a macro that could extract large portions of data from autotrader.com. I wI'll take a look at the site you linked.
 
Upvote 0
You are giving me too much credit...I'm pretty novice. I use excel to read from websites, then sort and compute data.

I did this on a large scale using a macro that could extract large portions of data from autotrader.com. I wI'll take a look at the site you linked.
The B1G's been around for a long time. Calculating a conservative 10 teams for 100 years (with one page having the data for 5 years) comes to 200 pages. And then you have the problem of looking at OOC team schedules as well. That's way too many to manually read, IMO.

I have a few other things to do, but eventually, I'll get around to this project using Python/Django... Hopefully. But if you manage to compile an excel file out of these results, let me know. I'll write a Java code to go through them and spit out the results.
 
Upvote 0
Is there a way to look at all the schedules at one time? That would be a lot easier. The way I have done it before is to just have excel import the information from every page using a macro to change the url for each new page. So it is not exactly manual, the macro will do the work, but it would certainly take a little time to get through all those schedules.

If I do it, I will have it do every team for the last 50 years or so.​
 
Upvote 0
I have been in touch with a highly respected authority on major college football statistical rankings, who I will not identify.
  • His model does not take into account bye weeks, overtimes, or night games.
  • He thinks that those issues could be important factors in some situations, but doesn't have time to collect that data and feels that these things average out during the year.
  • He didn't know of an easy way or site to collect the data.
  • He recommended www.scoresway.com, which lists games, attendance and time of day for each game. I looked and I don't see archival data there for more than one year. Possibly they might be able to download that for you.
  • We both agree that the best way to capture a bye week situation would be to create a variable each week to capture the number of days since the team's last game. This could then be used in a model to capture the bye week.
  • To me the interesting point is not if it ever happened but rather what is the benefit of having a bye week, a night game, an opponent who has an overtime...or a combination of the above.

If you folks collect these data, I could model this for you statistically.
 
Upvote 0
I don't like the scoresway site. Too much fancy html stuff. And you're right, I didn't find historical records.

This is a great off-season project. I won't be able to get to it until January, unfortunately.

The night-game and overtime information isn't available on cfbdatawarehouse. ESPN has that information, but not on the same page, and they have way too much fancy html and javascript on their source to be useful anyway. And ESPN only has records dating to 2002 anyways.

My friend mentioned that there are some command line tools like wget and curl that work with pure text websites. I have to look into this. And there is apparently this neat Python based web page parser called Beautiful Soup. I'll take a look once I am done moving and get settled in.
 
Upvote 0
Back
Top