hockey_scraper¶
The hockey_scraper module contains all of the functions used for scraping.
Scraping¶
There are three ways to scrape games:
1. Scrape by Season:
Scrape games on a season by season level (Note: A given season is referred to by the first of the two years it spans. So you would refer to the 2016-2017 season as 2016).
import hockey_scraper
# Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file (both are equivalent!!!)
hockey_scraper.scrape_seasons([2015, 2016], True)
hockey_scraper.scrape_seasons([2015, 2016], True, data_format='Csv')
# Scrapes the 2008 season without shifts and returns a json string of the data
scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Json')
# Scrapes 2014 season without shifts including preseason games
hockey_scraper.scrape_seasons([2014], False, preseason=True)
2. Scrape by Game:
Scrape a list of games provided. All game ID’s can be found using this link (you need to play around with the dates in the url).
import hockey_scraper
# Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file
hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)
# Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a Json string of the data
scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Json')
3. Scrape by Date Range:
Scrape all games between a specified date range. All dates must be written in a “yyyy-mm-dd” format.
import hockey_scraper
# Scrapes all games between date range without shifts and stores the data in a Csv file (both are equivalent!!!)
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False, preseason=False)
# Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a Json string of the data
scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Json')
# Scrapes all games from 2014-09-15 to 2014-11-01 with shifts including preseason games
hockey_scraper.scrape_date_range('2014-09-15', '2014-11-01', True, preseason=True)
Notes:
1. For all three functions you must specify if you want to also scrape shifts (TOI tables) with a boolean. The Play by Play is automatically scraped.
2. When scraping by date range or by season, preseason games aren’t scraped unless otherwise specified.
3. For all three functions the scraped data is deposited into a Csv file unless it’s specified to return it as a Json string.
4. The Json string returned is structured like so:
# When scraping by game or date range
"
{
'pbp': [
Plays
],
'shifts': [
Shifts
]
}
"
# When scraping by season
"
{
'pbp': {
'Seasons': [
Plays
]
},
'shifts': {
'Seasons': [
Plays
]
}
}
"
# For example, if you scraped the 2008 and 2009 seasons the Json will look like this:
"
{
'pbp': {
'2008': [
Plays
],
'2009': [
Plays
]
},
'shifts': {
'2008': [
Shifts
],
'2009': [
Shifts
]
}
}
"
Functions¶
Scrape Functions¶
Functions to scrape by season, games, and date range
-
hockey_scraper.scrape_functions.
check_data_format
(data_format)¶ Checks if data_format specified (if it is at all) is either None, ‘Csv’, or ‘json’. It exits program with error message if input isn’t good.
Parameters: data_format – data_format provided Returns: None
-
hockey_scraper.scrape_functions.
check_valid_dates
(from_date, to_date)¶ Check if it’s a valid date range
Parameters: - from_date – date should scrape from
- to_date – date should scrape to
Returns: None, will exit if not valid
-
hockey_scraper.scrape_functions.
scrape_date_range
(from_date, to_date, if_scrape_shifts, data_format='csv', preseason=False)¶ Scrape games in given date range
Parameters: - from_date – date you want to scrape from
- to_date – date you want to scrape to
- if_scrape_shifts – Boolean indicating whether to also scrape shifts
- data_format – format you want data in - csv or json (csv is default)
- preseason – Boolean indicating whether include preseason games (default if False)
Returns: Json string or None
-
hockey_scraper.scrape_functions.
scrape_games
(games, if_scrape_shifts, data_format='csv')¶ Scrape a list of games
Parameters: - games – list of game_ids
- if_scrape_shifts – Boolean indicating whether to also scrape shifts
- data_format – format you want data in - csv or json (csv is default)
- preseason – Boolean indicating whether include preseason games (default if False)
Returns: Json string or None
-
hockey_scraper.scrape_functions.
scrape_list_of_games
(games, if_scrape_shifts)¶ Given a list of game_id’s (and a date for each game) it scrapes them
Parameters: - games – list of [game_id, date]
- if_scrape_shifts – Boolean indicating whether to also scrape shifts
Returns: DataFrame of pbp info, also shifts if specified
-
hockey_scraper.scrape_functions.
scrape_seasons
(seasons, if_scrape_shifts, data_format='csv', preseason=False)¶ Given list of seasons it scrapes all the seasons
Parameters: - seasons – list of seasons
- if_scrape_shifts – Boolean indicating whether to also scrape shifts
- data_format – format you want data in - csv or json (csv is default)
- preseason – Boolean indicating whether include preseason games (default if False)
Returns: Json string or None
-
hockey_scraper.scrape_functions.
to_csv
(file_name, pbp_df, shifts_df)¶ Write DataFrame(s) to csv file(s)
Parameters: - file_name – name of file
- pbp_df – pbp DataFrame
- shifts_df – shifts DataFrame
Returns: None
Game Scraper¶
This module contains code to scrape data for a single game
-
hockey_scraper.game_scraper.
check_goalie
(row)¶ Checks for bad goalie names (you can tell by them having no player id)
Parameters: row – df row Returns: None
-
hockey_scraper.game_scraper.
combine_espn_html_pbp
(html_df, espn_df, game_id, date, away_team, home_team)¶ Merge the coordinate from the espn feed into the html DataFrame
Parameters: - html_df – DataFrame with info from html pbp
- espn_df – DataFrame with info from espn pbp
- game_id – json game id
- date – ex: 2016-10-24
- away_team – away team
- home_team – home team
Returns: merged DataFrame
-
hockey_scraper.game_scraper.
combine_html_json_pbp
(json_df, html_df, game_id, date)¶ Join both data sources
Parameters: - json_df – json pbp DataFrame
- html_df – html pbp DataFrame
- game_id – id of game
- date – date of game
Returns: finished pbp
-
hockey_scraper.game_scraper.
combine_players_lists
(json_players, roster_players, game_id)¶ Combine the json list of players (which contains id’s) with the list in the roster html
Parameters: - json_players – dict of all players with id’s
- roster_players – dict with home and and away keys for players
- game_id – id of game
Returns: dict containing home and away keys -> which contains list of info on each player
-
hockey_scraper.game_scraper.
get_players_json
(json)¶ Return dict of players for that game
Parameters: json – gameData section of json Returns: dict of players->keys are the name (in uppercase)
-
hockey_scraper.game_scraper.
get_teams_and_players
(game_json, roster, game_id)¶ Get list of players and teams for game
Parameters: - game_json – json pbp for game
- roster – players from roster html
- game_id – id for game
Returns: dict for both - players and teams
-
hockey_scraper.game_scraper.
print_errors
()¶ Print errors with scraping
Returns: None
-
hockey_scraper.game_scraper.
scrape_game
(game_id, date, if_scrape_shifts)¶ This scrapes the info for the game. The pbp is automatically scraped, and the whether or not to scrape the shifts is left up to the user.
Parameters: - game_id – game to scrap
- date – ex: 2016-10-24
- if_scrape_shifts – Boolean indicating whether to also scrape shifts
Returns: DataFrame of pbp info (optional) DataFrame with shift info otherwise just None
-
hockey_scraper.game_scraper.
scrape_pbp
(game_id, date, roster, game_json, players, teams)¶ Automatically scrapes the json and html, if the json is empty the html picks up some of the slack and the espn xml is also scraped for coordinates.
Parameters: - game_id – json game id
- date – date of game
- roster – list of players in pre game roster
- game_json – json pbp for game
- players – dict of players
- teams – dict of teams
Returns: DataFrame with info or None if it fails
-
hockey_scraper.game_scraper.
scrape_shifts
(game_id, players, date)¶ Scrape the Shift charts (or TOI tables)
Parameters: - game_id – json game id
- players – dict of players with numbers and id’s
- date – date of game
Returns: DataFrame with info or None if it fails
Html PBP¶
This module contains functions to scrape the Html Play by Play for any given game
-
hockey_scraper.html_pbp.
add_event_players
(event_dict, event, players, home_team)¶ Add players involved in the event to event_dict
Parameters: - event_dict – dict of parsed event stuff
- event – fixed up html
- players – dict of players and id’s
- home_team – home team
Returns: None
-
hockey_scraper.html_pbp.
add_event_team
(event_dict, event)¶ Add event team for event
Parameters: - event_dict – dict of event info
- event – list with parsed event info
Returns: None
-
hockey_scraper.html_pbp.
add_home_zone
(event_dict, home_team)¶ Determines the zone relative to the home team and add it to event
Parameters: - event_dict – dict of event info
- home_team – home team
Returns: None
-
hockey_scraper.html_pbp.
add_period
(event_dict, event)¶ Add period for event
Parameters: - event_dict – dict of event info
- event – list with parsed event info
Returns: None
-
hockey_scraper.html_pbp.
add_score
(event_dict, event, current_score, home_team)¶ Change if someone scored...also change current score
Parameters: - event_dict – dict of parsed event stuff
- event – event info from pbp
- current_score – current score in game
- home_team – home team for game
Returns: None
-
hockey_scraper.html_pbp.
add_strength
(event_dict, home_players, away_players)¶ Get strength for event -> It’s home then away
Parameters: - event_dict – dict of event info
- home_players – list of players for home team
- away_players – list of players for away team
Returns: None
-
hockey_scraper.html_pbp.
add_time
(event_dict, event)¶ Fill in time and seconds elapsed
Parameters: - event_dict – dict of parsed event stuff
- event – event info from pbp
Returns: None
-
hockey_scraper.html_pbp.
add_type
(event_dict, event)¶ Add “type” for event -> either a penalty or a shot type
Parameters: - event_dict – dict of event info
- event – list with parsed event info
Returns: None
-
hockey_scraper.html_pbp.
add_zone
(event_dict, play_description)¶ Determine which zone the play occurred in (unless one isn’t listed) and add it to dict
Parameters: - event_dict – dict of event info
- play_description – the zone would be included here
Returns: Off, Def, Neu, or NA
-
hockey_scraper.html_pbp.
clean_html_pbp
(html)¶ Get rid of html and format the data
Parameters: html – the requested url Returns: a list with all the info
-
hockey_scraper.html_pbp.
get_pbp
(game_id)¶ Given a game_id it returns the raw html Ex: http://www.nhl.com/scores/htmlreports/20162017/PL020475.HTM
Parameters: game_id – the game Returns: raw html of game
-
hockey_scraper.html_pbp.
get_penalty
(play_description)¶ Get the penalty info
Parameters: play_description – description of play field Returns: penalty info
-
hockey_scraper.html_pbp.
get_player_name
(number, players, team, home_team)¶ This function is used for the description field in the html. Given a last name and a number it return the player’s full name and id.
Parameters: - number – player’s number
- players – all players with info
- team – team of player
- home_team – home team
Returns: dict with full and and id
-
hockey_scraper.html_pbp.
get_soup
(game_html)¶ Uses Beautiful soup to parses the html document. Some parsers work for some pages but don’t work for others....I’m not sure why so I just try them all here in order
Parameters: game_html – html doc Returns: “soupified” html and player_shifts portion of html (it’s a bunch of td tags)
-
hockey_scraper.html_pbp.
if_valid_event
(event)¶ Checks if it’s a valid event (‘#’ is meaningless and I don’t like the other ones) to parse
Parameters: event – list of stuff in pbp Returns: boolean
-
hockey_scraper.html_pbp.
parse_event
(event, players, home_team, if_plays_in_json, current_score)¶ Receives an event and parses it
Parameters: - event – event type
- players – players in game
- home_team – home team
- if_plays_in_json – If the pbp json contains the plays
- current_score – current score for both teams
Returns: dict with info
-
hockey_scraper.html_pbp.
parse_html
(html, players, teams, if_plays_in_json)¶ Parse html game pbp
Parameters: - html – raw html
- players – players in the game (from json pbp)
- teams – dict with home and away teams
- if_plays_in_json – If the pbp json contains the plays
Returns: DataFrame with info
-
hockey_scraper.html_pbp.
populate_players
(event_dict, players, away_players, home_players)¶ Populate away and home player info (and num skaters on each side) NOTE: Could probably do this in a much neater way...
Parameters: - event_dict – dict with event info
- players – all players in game and info
- away_players – players for away team
- home_players – players for home team
Returns: None
-
hockey_scraper.html_pbp.
return_name_html
(info)¶ In the PBP html the name is in a format like: ‘Center - MIKE RICHARDS’ Some also have a hyphen in their last name so can’t just split by ‘-‘
Parameters: info – position and name Returns: name
-
hockey_scraper.html_pbp.
scrape_game
(game_id, players, teams, if_plays_in_json)¶ Scrape the data for the game
Parameters: - game_id – game to scrape
- players – dict with player info
- teams – dict with home and away teams
- if_plays_in_json – boolean, if the plays are in the json
Returns: DataFrame of game info or None if it fails
-
hockey_scraper.html_pbp.
shot_type
(play_description)¶ Determine which zone the play occurred in (unless one isn’t listed)
Parameters: play_description – the type would be in here Returns: the type if it’s there (otherwise just NA)
-
hockey_scraper.html_pbp.
strip_html_pbp
(td)¶ Strip html tags and such
Parameters: td – pbp Returns: list of plays (which contain a list of info) stripped of html
Json PBP¶
This module contains functions to scrape the Json Play by Play for any given game
-
hockey_scraper.json_pbp.
change_event_name
(event)¶ Change event names from json style to html ex: BLOCKED_SHOT to BLOCK
Parameters: event – event type Returns: fixed event type
-
hockey_scraper.json_pbp.
get_pbp
(game_id)¶ Given a game_id it returns the raw json Ex: http://statsapi.web.nhl.com/api/v1/game/2016020475/feed/live
Parameters: game_id – the game Returns: raw json of game or None if couldn’t get game
-
hockey_scraper.json_pbp.
get_teams
(pbp_json)¶ Get teams
Parameters: json – pbp json Returns: dict with home and away
-
hockey_scraper.json_pbp.
parse_event
(event)¶ Parses a single event when the info is in a json format
Parameters: event – json of event Returns: dictionary with the info
-
hockey_scraper.json_pbp.
parse_json
(game_json, game_id)¶ Scrape the json for a game
Parameters: - game_json – raw json
- game_id – game id for game
Returns: Either a DataFrame with info for the game
-
hockey_scraper.json_pbp.
scrape_game
(game_id)¶ Used for debugging. HTML depends on json so can’t follow this structure
Parameters: game_id – game to scrape Returns: DataFrame of game info
Espn PBP¶
This module contains code to scrape coordinates for games off of espn for any given game
-
hockey_scraper.espn_pbp.
event_type
(play_description)¶ Returns the event type (ex: a SHOT or a GOAL...etc) given the event description
Parameters: play_description – description of play Returns: event
-
hockey_scraper.espn_pbp.
get_espn
(date, home_team, away_team)¶ Gets the ESPN pbp feed Ex: http://www.espn.com/nhl/gamecast/data/masterFeed?lang=en&isAll=true&gameId=400885300
Parameters: - date – date of the game
- home_team – home team
- away_team – away team
Returns: raw xml
-
hockey_scraper.espn_pbp.
get_espn_game_id
(date, home_team, away_team)¶ Scrapes the day’s schedule and gets the id for the given game Ex: http://www.espn.com/nhl/scoreboard?date=20161024
Parameters: - date – format-> YearMonthDay-> 20161024
- home_team – home team
- away_team – away team
Returns: 9 digit game id
-
hockey_scraper.espn_pbp.
get_game_ids
(response)¶ Get game_ids for date from doc
Parameters: response – doc Returns: list of game_ids
-
hockey_scraper.espn_pbp.
get_teams
(response)¶ Extract Teams for date from doc
Parameters: response – doc Returns: list of teams
-
hockey_scraper.espn_pbp.
parse_espn
(espn_xml)¶ Parse feed
Parameters: espn_xml – raw xml of feed Returns: DataFrame with info
-
hockey_scraper.espn_pbp.
parse_event
(event)¶ Parse each event. In the string each field is separated by a ‘~’. Relevant for here: The first two are the x and y coordinates. And the 4th and 5th are the time elapsed and period.
Parameters: event – string with info Returns: return dict with relevant info
-
hockey_scraper.espn_pbp.
scrape_game
(date, home_team, away_team)¶ Scrape the game
Parameters: - date – ex: 2016-20-24
- home_team – tricode
- away_team – tricode
Returns: DataFrame with info
Json Shifts¶
This module contains functions to scrape the Json toi/shifts for any given game
-
hockey_scraper.json_shifts.
fix_team_tricode
(tricode)¶ Some of the tricodes are different than how I want them
Parameters: tricode – 3 letter team name - ex: NYR Returns: fixed tricode
-
hockey_scraper.json_shifts.
get_shifts
(game_id)¶ Given a game_id it returns the raw json Ex: http://www.nhl.com/stats/rest/shiftcharts?cayenneExp=gameId=2010020001
Parameters: game_id – the game Returns: json or None
-
hockey_scraper.json_shifts.
parse_json
(shift_json, game_id)¶ Parse the json
Parameters: - shift_json – raw json
- game_id – if of game
Returns: DataFrame with info
-
hockey_scraper.json_shifts.
parse_shift
(shift)¶ Parse shift for json
Parameters: shift – json for shift Returns: dict with shift info
-
hockey_scraper.json_shifts.
scrape_game
(game_id)¶ Scrape the game.
Parameters: game_id – game Returns: DataFrame with info for the game
Html Shifts¶
This module contains functions to scrape the Html Toi Tables (or shifts) for any given game
-
hockey_scraper.html_shifts.
analyze_shifts
(shift, name, team, home_team, player_ids)¶ Analyze shifts for each player when using. Prior to this each player (in a dictionary) has a list with each entry being a shift.
Parameters: - shift – info on shift
- name – player name
- team – given team
- home_team – home team for given game
- player_ids – dict with info on players
Returns: dict with info for shift
-
hockey_scraper.html_shifts.
get_shifts
(game_id)¶ Given a game_id it returns a DataFrame with the shifts for both teams Ex: http://www.nhl.com/scores/htmlreports/20162017/TV020971.HTM
Parameters: game_id – the game Returns: DataFrame with all shifts or None
-
hockey_scraper.html_shifts.
get_soup
(shifts_html)¶ Uses Beautiful soup to parses the html document. Some parsers work for some pages but don’t work for others....I’m not sure why so I just try them all here in order
Parameters: shifts_html – html doc Returns: “soupified” html and player_shifts portion of html (it’s a bunch of td tags)
-
hockey_scraper.html_shifts.
get_teams
(soup)¶ Return the team for the TOI tables and the home team
Parameters: soup – souped up html Returns: list with team and home team
-
hockey_scraper.html_shifts.
parse_html
(html, player_ids, game_id)¶ Parse the html
Parameters: - html – cleaned up html
- player_ids – dict of home and away players
- game_id – id for game
Returns: DataFrame with info
-
hockey_scraper.html_shifts.
scrape_game
(game_id, players)¶ Scrape the game.
Parameters: - game_id – id for game
- players – list of players
Returns: DataFrame with info for the game
Schedule¶
This module contains functions to scrape the json schedule for any games or date range
-
hockey_scraper.json_schedule.
get_current_season
()¶ Get Season based on today’s date
Returns: season -> ex: 2016 for 2016-2017 season
-
hockey_scraper.json_schedule.
get_dates
(games)¶ Given a list game_ids it returns the dates for each game
Parameters: games – list with game_id’s ex: 2016020001 Returns: list with game_id and corresponding date for all games
-
hockey_scraper.json_schedule.
get_schedule
(date_from, date_to)¶ Scrapes games in date range Ex: https://statsapi.web.nhl.com/api/v1/schedule?startDate=2010-10-03&endDate=2011-06-20
Parameters: - date_from – scrape from this date
- date_to – scrape until this date
Returns: raw json of schedule of date range
-
hockey_scraper.json_schedule.
scrape_schedule
(date_from, date_to, preseason=False)¶ Calls getSchedule and scrapes the raw schedule JSON
Parameters: - date_from – scrape from this date
- date_to – scrape until this date
- preseason – Boolean indicating whether include preseason games (default if False)
Returns: list with all the game id’s
Playing Roster¶
This module contains functions to scrape the Html game roster for any given game
-
hockey_scraper.playing_roster.
fix_name
(player)¶ Get rid of (A) or (C) when a player has it attached to their name
Parameters: player – list of player info -> [number, position, name] Returns: fixed list
-
hockey_scraper.playing_roster.
get_coaches
(soup)¶ scrape head coaches
Parameters: soup – html Returns: dict of coaches for game
-
hockey_scraper.playing_roster.
get_content
(roster)¶ Uses Beautiful soup to parses the html document. Some parsers work for some pages but don’t work for others....I’m not sure why so I just try them all here in order
Parameters: roster – doc Returns: players and coaches
-
hockey_scraper.playing_roster.
get_players
(soup)¶ scrape roster for players
Parameters: soup – html Returns: dict for home and away players
-
hockey_scraper.playing_roster.
get_roster
(game_id)¶ Given a game_id it returns the raw html Ex: http://www.nhl.com/scores/htmlreports/20162017/RO020475.HTM
Parameters: game_id – the game Returns: raw html of game
-
hockey_scraper.playing_roster.
scrape_roster
(game_id)¶ For a given game scrapes the roster
Parameters: game_id – id for game Returns: dict of players (home and away) an dict for both head coaches