NHL Scraping Functions¶
Scraping¶
There are three ways to scrape games:
1. Scrape by Season:
Scrape games on a season by season level (Note: A given season is referred to by the first of the two years it spans. So you would refer to the 2016-2017 season as 2016).
import hockey_scraper
# Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file (both are equivalent!!!)
hockey_scraper.scrape_seasons([2015, 2016], True)
hockey_scraper.scrape_seasons([2015, 2016], True, data_format='Csv')
# Scrapes the 2008 season without shifts and returns a dictionary with the DataFrame
scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')
# Scrapes 2014 season without shifts including preseason games
hockey_scraper.scrape_seasons([2014], False, preseason=True)
2. Scrape by Game:
Scrape a list of games provided. All game ID’s can be found using this link (you need to play around with the dates in the url).
import hockey_scraper
# Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file
hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)
# Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a a dictionary with the DataFrames
scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')
3. Scrape by Date Range:
Scrape all games between a specified date range. All dates must be written in a “yyyy-mm-dd” format.
import hockey_scraper
# Scrapes all games between date range without shifts and stores the data in a Csv file (both are equivalent!!!)
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False, preseason=False)
# Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a a dictionary with the DataFrame
scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')
# Scrapes all games from 2014-09-15 to 2014-11-01 with shifts including preseason games
hockey_scraper.scrape_date_range('2014-09-15', '2014-11-01', True, preseason=True)
4. Scrape Schedule
Scrape the schedule between any given date range for past and future games. All dates must be written in a “yyyy-mm-dd” format. The default data_format is equal to ‘Pandas’. This returns a DataFrame and not a dictionary like others. The columns returned are: [‘game_id’, ‘date’, ‘venue’, ‘home_team’, ‘away_team’, ‘start_time’, ‘home_score’, ‘away_score’, ‘status’]
import hockey_scraper
sched_df = hockey_scraper.scrape_schedule("2019-10-01", "2020-07-01")
Saving Files
The option also exists to save the scraped files in another directory. This would speed up re-scraping any games since we already have the docs needed for it. It would also be useful if you want to grab any extra information from them as some of them contain a lot more information. In order to do this you can use the ‘docs_dir’ keyword. One can specify the boolean value True to either create or refer (to an already created) directory in the home directory called hockey_scraper data. Or you can specify the directory with the string of the path. If this is a valid directory, when scraping each page it would first check if it was already scraped (therefore saving us the time of scraping it). If it hasn’t been scraped yet, it will then grab it from the source and save it in the given directory.
Sometimes you may have already scraped and saved a file but you want to re-scrape it from the source and save it again (this may seem strange but the NHL frequently fixes mistakes so you may want to update what you have). This can be done by setting the keyword argument rescrape equal to True.
import hockey_scraper
# Path to the given directory
# Can also be True if you want the scraper to take care of it
USER_PATH = "/...."
# Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file
# Also includes a path for an existing directory for the scraped files to be placed in or retrieved from.
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH)
# Once could chose to re-scrape previously saved files by making the keyword argument rescrape=True
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)
Additional Notes:
1. For all three functions you must specify if you want to also scrape shifts (TOI tables) with a boolean. The Play by Play is automatically scraped.
2. When scraping by date range or by season, preseason games aren’t scraped unless otherwise specified. Also preseason games are scraped at your own risk. There is no guarantee it will work or that the files are even there!!!
3. For all three functions the scraped data is deposited into a Csv file unless it’s specified to return the DataFrames
4. The Dictionary with the DataFrames (and scraping errors) returned by setting data_format=’Pandas’ is structured like:
{
# Both of these are always included
'pbp': pbp_df,
'errors': scraping_errors,
# This is only included when the argument 'if_scrape_shifts' is set equal to True
'shifts': shifts_df
}
5. When including a directory, it must be a valid directory. It will not create it for you. You’ll get an error message but otherwise it will scrape as if no directory was provided.
Scrape Functions¶
Functions to scrape by season, games, and date range
-
hockey_scraper.nhl.scrape_functions.
print_errors
()¶ Print errors with scraping.
Also puts errors in the “error” string (would just print the string but it would look like shit on one line. I could store it as I “should” print it but that isn’t how I want it).
Returns: None
-
hockey_scraper.nhl.scrape_functions.
scrape_date_range
(from_date, to_date, if_scrape_shifts, data_format='csv', preseason=False, rescrape=False, docs_dir=False)¶ Scrape games in given date range
Parameters: - from_date – date you want to scrape from
- to_date – date you want to scrape to
- if_scrape_shifts – Boolean indicating whether to also scrape shifts
- data_format – format you want data in - csv or pandas (csv is default)
- preseason – Boolean indicating whether to include preseason games (default if False) This is may or may not work!!! I don’t give a shit.
- rescrape – If you want to rescrape pages already scraped. Only applies if you supply a docs dir. (def. = None)
- docs_dir – Directory that either contains previously scraped docs or one that you want them to be deposited in after scraping. When True it’ll refer to (or if needed create) such a repository in the home directory. When provided a string it’ll try to use that. Here it must be a valid directory otheriwse it won’t work (I won’t make it for you). When False the files won’t be saved.
Returns: Dictionary with DataFrames and errors or None
-
hockey_scraper.nhl.scrape_functions.
scrape_games
(games, if_scrape_shifts, data_format='csv', rescrape=False, docs_dir=False)¶ Scrape a list of games
Parameters: - games – list of game_ids
- if_scrape_shifts – Boolean indicating whether to also scrape shifts
- data_format – format you want data in - csv or pandas (csv is default)
- rescrape – If you want to rescrape pages already scraped. Only applies if you supply a docs dir.
- docs_dir – Directory that either contains previously scraped docs or one that you want them to be deposited in after scraping. When True it’ll refer to (or if needed create) such a repository in the home directory. When provided a string it’ll try to use that. Here it must be a valid directory otheriwse it won’t work (I won’t make it for you). When False the files won’t be saved.
Returns: Dictionary with DataFrames and errors or None
-
hockey_scraper.nhl.scrape_functions.
scrape_list_of_games
(games, if_scrape_shifts)¶ Given a list of game_id’s (and a date for each game) it scrapes them
Parameters: - games – list of [game_id, date]
- if_scrape_shifts – Boolean indicating whether to also scrape shifts
Returns: DataFrame of pbp info, also shifts if specified
-
hockey_scraper.nhl.scrape_functions.
scrape_schedule
(from_date, to_date, data_format='pandas', rescrape=False, docs_dir=False)¶ Scrape the games schedule in a given range.
Parameters: - from_date – date you want to scrape from
- to_date – date you want to scrape to
- data_format – format you want data in - csv or pandas (pandas is default)
- rescrape – If you want to rescrape pages already scraped. Only applies if you supply a docs dir. (def. = None)
- docs_dir – Directory that either contains previously scraped docs or one that you want them to be deposited in after scraping. When True it’ll refer to (or if needed create) such a repository in the home directory. When provided a string it’ll try to use that. Here it must be a valid directory otheriwse it won’t work (I won’t make it for you). When False the files won’t be saved.
Returns: DataFrame of None
-
hockey_scraper.nhl.scrape_functions.
scrape_seasons
(seasons, if_scrape_shifts, data_format='csv', preseason=False, rescrape=False, docs_dir=False)¶ Given list of seasons it scrapes all the seasons
Parameters: - seasons – list of seasons
- if_scrape_shifts – Boolean indicating whether to also scrape shifts
- data_format – format you want data in - csv or pandas (csv is default)
- preseason – Boolean indicating whether to include preseason games (default if False) This is may or may not work!!! I don’t give a shit.
- rescrape – If you want to rescrape pages already scraped. Only applies if you supply a docs dir.
- docs_dir – Directory that either contains previously scraped docs or one that you want them to be deposited in after scraping. When True it’ll refer to (or if needed create) such a repository in the home directory. When provided a string it’ll try to use that. Here it must be a valid directory otheriwse it won’t work (I won’t make it for you). When False the files won’t be saved.
Returns: Dictionary with DataFrames and errors or None
Game Scraper¶
This module contains code to scrape data for a single game
-
hockey_scraper.nhl.game_scraper.
check_goalie
(row)¶ Checks for bad goalie names (you can tell by them having no player id)
Parameters: row – df row Returns: None
-
hockey_scraper.nhl.game_scraper.
combine_espn_html_pbp
(html_df, espn_df, game_id, date, away_team, home_team)¶ Merge the coordinate from the espn feed into the html DataFrame
Can’t join here on event_id because the plays are often out of order and pre-2009 are often missing events.
Parameters: - html_df – DataFrame with info from html pbp
- espn_df – DataFrame with info from espn pbp
- game_id – json game id
- date – ex: 2016-10-24
- away_team – away team
- home_team – home team
Returns: merged DataFrame
-
hockey_scraper.nhl.game_scraper.
combine_html_json_pbp
(json_df, html_df, game_id, date)¶ Join both data sources. First try merging on event id (which is the DataFrame index) if both DataFrames have the same number of rows. If they don’t have the same number of rows, merge on: Period’, Event, Seconds_Elapsed, p1_ID.
Parameters: - json_df – json pbp DataFrame
- html_df – html pbp DataFrame
- game_id – id of game
- date – date of game
Returns: finished pbp
-
hockey_scraper.nhl.game_scraper.
combine_players_lists
(json_players, roster_players, game_id)¶ Combine the json list of players (which contains id’s) with the list in the roster html
Parameters: - json_players – dict of all players with id’s
- roster_players – dict with home and and away keys for players
- game_id – id of game
Returns: dict containing home and away keys -> which contains list of info on each player
-
hockey_scraper.nhl.game_scraper.
get_players_json
(players_json)¶ Return dict of players for that game
Parameters: players_json – players section of json Returns: dict of players->keys are the name (in uppercase)
-
hockey_scraper.nhl.game_scraper.
get_sebastian_aho
(player)¶ This checks which Sebastian Aho it is based on the position. I have the player id’s hardcoded here.
This function is needed because “get_players_json” doesn’t control for when there are two Sebastian Aho’s (it just writes over the first one).
Parameters: player – player info Returns: Player ID for specific Aho
-
hockey_scraper.nhl.game_scraper.
get_teams_and_players
(game_json, roster, game_id)¶ Get list of players and teams for game
Parameters: - game_json – json pbp for game
- roster – players from roster html
- game_id – id for game
Returns: dict for both - players and teams
-
hockey_scraper.nhl.game_scraper.
scrape_game
(game_id, date, if_scrape_shifts)¶ This scrapes the info for the game. The pbp is automatically scraped, and the whether or not to scrape the shifts is left up to the user.
Parameters: - game_id – game to scrap
- date – ex: 2016-10-24
- if_scrape_shifts – Boolean indicating whether to also scrape shifts
Returns: DataFrame of pbp info (optional) DataFrame with shift info otherwise just None
-
hockey_scraper.nhl.game_scraper.
scrape_pbp
(game_id, date, roster, game_json, players, teams, espn_id=None, html_df=None)¶ Automatically scrapes the json and html, if the json is empty the html picks up some of the slack and the espn xml is also scraped for coordinates.
Parameters: - game_id – json game id
- date – date of game
- roster – list of players in pre game roster
- game_json – json pbp for game
- players – dict of players
- teams – dict of teams
- espn_id – Game Id for the espn game. Only provided when live scraping
- html_df – Can provide DataFrame for html. Only done for live-scraping
Returns: DataFrame with info or None if it fails
-
hockey_scraper.nhl.game_scraper.
scrape_pbp_live
(game_id, date, roster, game_json, players, teams, espn_id=None)¶ Scrape the live pbp
Parameters: - game_id – json game id
- date – date of game
- roster – list of players in pre game roster
- game_json – json pbp for game
- players – dict of players
- teams – dict of teams
- espn_id – Game Id for the espn game. Only provided when live scraping
Returns: Tuple - pbp & status
-
hockey_scraper.nhl.game_scraper.
scrape_shifts
(game_id, players, date)¶ Scrape the Shift charts (or TOI tables)
Parameters: - game_id – json game id
- players – dict of players with numbers and id’s
- date – date of game
Returns: DataFrame with info or None if it fails
Html PBP¶
This module contains functions to scrape the Html Play by Play for any given game
-
hockey_scraper.nhl.pbp.html_pbp.
add_event_players
(event_dict, event, players, home_team)¶ Add players involved in the event to event_dict
Parameters: - event_dict – dict of parsed event stuff
- event – fixed up html
- players – dict of players and id’s
- home_team – home team
Returns: None
-
hockey_scraper.nhl.pbp.html_pbp.
add_event_team
(event_dict, event)¶ Add event team for event
Parameters: - event_dict – dict of event info
- event – list with parsed event info
Returns: None
-
hockey_scraper.nhl.pbp.html_pbp.
add_home_zone
(event_dict, home_team)¶ Determines the zone relative to the home team and add it to event.
Keep in mind that the ‘ev_zone’ recorded is the zone relative to the event team. And for blocks the NHL counts the ev_team as the blocking team (I like counting the shooting team for blocks). Therefore, when it’s the home team the zone only gets flipped when it’s a block. For away teams it’s the opposite. Think about it…
Parameters: - event_dict – dict of event info
- home_team – home team
Returns: None
-
hockey_scraper.nhl.pbp.html_pbp.
add_period
(event_dict, event)¶ Add period for event
Parameters: - event_dict – dict of event info
- event – list with parsed event info
Returns: None
-
hockey_scraper.nhl.pbp.html_pbp.
add_score
(event_dict, event, current_score, home_team)¶ Change if someone scored…also change current score
Parameters: - event_dict – dict of parsed event stuff
- event – event info from pbp
- current_score – current score in game
- home_team – home team for game
Returns: None
-
hockey_scraper.nhl.pbp.html_pbp.
add_strength
(event_dict, home_players, away_players)¶ Get strength for event -> It’s home then away
Parameters: - event_dict – dict of event info
- home_players – list of players for home team
- away_players – list of players for away team
Returns: None
-
hockey_scraper.nhl.pbp.html_pbp.
add_time
(event_dict, event)¶ Fill in time and seconds elapsed
Parameters: - event_dict – dict of parsed event stuff
- event – event info from pbp
Returns: None
-
hockey_scraper.nhl.pbp.html_pbp.
add_type
(event_dict, event, players, home_team)¶ Add “type” for event -> either a penalty or a shot type
Parameters: - event_dict – dict of event info
- event – list with parsed event info
- players – dict of home and away players in game
- home_team – home team for game
Returns: None
-
hockey_scraper.nhl.pbp.html_pbp.
add_zone
(event_dict, play_description)¶ Determine which zone the play occurred in (unless one isn’t listed) and add it to dict
Parameters: - event_dict – dict of event info
- play_description – the zone would be included here
Returns: Off, Def, Neu, or NA
-
hockey_scraper.nhl.pbp.html_pbp.
clean_html_pbp
(html)¶ Get rid of html and format the data
Parameters: html – the requested url Returns: a list with all the info
-
hockey_scraper.nhl.pbp.html_pbp.
cur_game_status
(doc)¶ Return the game status
Parameters: doc – Html text Returns: String -> one of [‘Final’, ‘Intermission’, ‘Progress’]
-
hockey_scraper.nhl.pbp.html_pbp.
get_pbp
(game_id)¶ Given a game_id it returns the raw html Ex: http://www.nhl.com/scores/htmlreports/20162017/PL020475.HTM
Parameters: game_id – the game Returns: raw html of game
-
hockey_scraper.nhl.pbp.html_pbp.
get_penalty
(play_description, players, home_team)¶ Get the penalty info
Parameters: - play_description – description of play field
- players – all players with info
- home_team – home team for game
Returns: penalty info
-
hockey_scraper.nhl.pbp.html_pbp.
get_player_name
(number, players, team, home_team)¶ This function is used for the description field in the html. Given a last name and a number it return the player’s full name and id.
Parameters: - number – player’s number
- players – all players with info
- team – team of player
- home_team – home team
Returns: dict with full and and id
-
hockey_scraper.nhl.pbp.html_pbp.
get_soup
(game_html)¶ Uses Beautiful soup to parses the html document. Some parsers work for some pages but don’t work for others….I’m not sure why so I just try them all here in order
Parameters: game_html – html doc Returns: “soupified” html
-
hockey_scraper.nhl.pbp.html_pbp.
if_valid_event
(event)¶ Checks if it’s a valid event (‘#’ is meaningless and I don’t like those other one’s) to parse
Don’t remember what ‘GOFF’ is but ‘EGT’ is for emergency goaltender. The reason I get rid of it is because it’s not in the json and there’s another ‘EGPID’ that can be found in both (not sure why ‘EGT’ exists then).
Events ‘PGSTR’, ‘PGEND’, and ‘ANTHEM’ have been included at the start of each game for the 2017 season…I have no idea why.
Parameters: event – list of stuff in pbp Returns: boolean
-
hockey_scraper.nhl.pbp.html_pbp.
parse_block
(description, players, home_team)¶ Parse the description field for a BLOCK
MTL #76 SUBBAN BLOCKED BY TOR #2 SCHENN, Wrist, Def. Zone
Parameters: - description – Play Description
- players – players in game
- home_team – Home Team for game
Returns: Dict with info
-
hockey_scraper.nhl.pbp.html_pbp.
parse_event
(event, players, home_team, current_score)¶ Receives an event and parses it
Parameters: - event – event type
- players – players in game
- home_team – home team
- current_score – current score for both teams
Returns: dict with info
-
hockey_scraper.nhl.pbp.html_pbp.
parse_fac
(description, players, ev_team, home_team)¶ Parse the description field for a face-off MTL won Neu. Zone - MTL #11 GOMEZ vs TOR #37 BRENT
Parameters: - description – Play Description
- players – players in game
- ev_team – Event Team
- home_team – Home Team for game
Returns: Dict with info
-
hockey_scraper.nhl.pbp.html_pbp.
parse_goal
(description, players, ev_team, home_team)¶ Parse the description field for a GOAL
TOR #81 KESSEL(1), Wrist, Off. Zone, 14 ft. Assists: #42 BOZAK(1); #8 KOMISAREK(1)
Parameters: - description – Play Description
- players – players in game
- ev_team – Event Team
- home_team – Home Team for game
Returns: Dict with info
-
hockey_scraper.nhl.pbp.html_pbp.
parse_hit
(description, players, home_team)¶ Parse the description field for a HIT
MTL #20 O’BYRNE HIT TOR #18 BROWN, Def. Zone
Parameters: - description – Play Description
- players – players in game
- home_team – Home Team for game
Returns: Dict with info
-
hockey_scraper.nhl.pbp.html_pbp.
parse_html
(html, players, teams)¶ Parse html game pbp
Parameters: - html – raw html
- players – players in the game (from json pbp)
- teams – dict with home and away teams
Returns: DataFrame with info
-
hockey_scraper.nhl.pbp.html_pbp.
parse_penalty
(description, players, home_team)¶ Parse the description field for a Penalty
MTL #81 ELLER Hooking(2 min), Def. Zone Drawn By: TOR #11 SJOSTROM
Parameters: - description – Play Description
- players – players in game
- home_team – Home Team for game
Returns: Dict with info
-
hockey_scraper.nhl.pbp.html_pbp.
parse_shot_miss_take_give
(description, players, ev_team, home_team)¶ Parse the description field for a: SHOT, MISS, TAKE, GIVE
MTL ONGOAL - #81 ELLER, Wrist, Off. Zone, 11 ft. ANA #23 BEAUCHEMIN, Slap, Wide of Net, Off. Zone, 42 ft. TOR GIVEAWAY - #35 GIGUERE, Def. Zone TOR TAKEAWAY - #9 ARMSTRONG, Off. Zone
Parameters: - description – Play Description
- players – players in game
- ev_team – Event Team
- home_team – Home Team for game
Returns: Dict with info
-
hockey_scraper.nhl.pbp.html_pbp.
populate_players
(event_dict, players, away_players, home_players)¶ Populate away and home player info (and num skaters on each side)
Parameters: - event_dict – dict with event info
- players – all players in game and info
- away_players – players for away team
- home_players – players for home team
Returns: None
-
hockey_scraper.nhl.pbp.html_pbp.
return_name_html
(info)¶ In the PBP html the name is in a format like: ‘Center - MIKE RICHARDS’ Some also have a hyphen in their last name so can’t just split by ‘-‘
Parameters: info – position and name Returns: name
-
hockey_scraper.nhl.pbp.html_pbp.
scrape_game
(game_id, players, teams)¶ Scrape the data for the game when not live
Parameters: - game_id – game to scrape
- players – dict with player info
- teams – dict with home and away teams
Returns: DataFrame of game info or None if it fails
-
hockey_scraper.nhl.pbp.html_pbp.
scrape_game_live
(game_id, players, teams)¶ Scrape the data for the game when it’s live
Parameters: - game_id – game to scrape
- players – dict with player info
- teams – dict with home and away teams
Returns: Tuple - get_pbp(), cur_game_status()
-
hockey_scraper.nhl.pbp.html_pbp.
scrape_pbp
(game_html, game_id, players, teams)¶ Scrape the data for the pbp
Parameters: - game_html – Html doc for the game
- game_id – game to scrape
- players – dict with player info
- teams – dict with home and away teams
Returns: DataFrame of game info or None if it fails
-
hockey_scraper.nhl.pbp.html_pbp.
shot_type
(play_description)¶ Determine which zone the play occurred in (unless one isn’t listed)
Parameters: play_description – the type would be in here Returns: the type if it’s there (otherwise just NA)
-
hockey_scraper.nhl.pbp.html_pbp.
strip_html_pbp
(td)¶ Strip html tags and such
Parameters: td – pbp Returns: list of plays (which contain a list of info) stripped of html
Json PBP¶
This module contains functions to scrape the Json Play by Play for any given game
-
hockey_scraper.nhl.pbp.json_pbp.
change_event_name
(event)¶ Change event names from json style to html (ex: BLOCKED_SHOT to BLOCK).
Parameters: event – event type Returns: fixed event type
-
hockey_scraper.nhl.pbp.json_pbp.
get_pbp
(game_id)¶ Given a game_id it returns the raw json Ex: http://statsapi.web.nhl.com/api/v1/game/2016020475/feed/live
Parameters: game_id – string - the game Returns: raw json of game or None if couldn’t get game
-
hockey_scraper.nhl.pbp.json_pbp.
get_teams
(pbp_json)¶ Get teams
Parameters: pbp_json – raw play by play json Returns: dict with home and away
-
hockey_scraper.nhl.pbp.json_pbp.
parse_event
(event)¶ Parses a single event when the info is in a json format
Parameters: event – json of event Returns: dictionary with the info
-
hockey_scraper.nhl.pbp.json_pbp.
parse_json
(game_json, game_id)¶ Scrape the json for a game
Parameters: - game_json – raw json
- game_id – game id for game
Returns: Either a DataFrame with info for the game
-
hockey_scraper.nhl.pbp.json_pbp.
scrape_game
(game_id)¶ Used for debugging. HTML depends on json so can’t follow this structure
Parameters: game_id – game to scrape Returns: DataFrame of game info
Espn PBP¶
This module contains code to scrape coordinates for games off of espn for any given game
-
hockey_scraper.nhl.pbp.espn_pbp.
event_type
(play_description)¶ Returns the event type (ex: a SHOT or a GOAL…etc) given the event description
Parameters: play_description – description of play Returns: event
-
hockey_scraper.nhl.pbp.espn_pbp.
get_espn_date
(date)¶ Get the page that contains all the games for that day
Parameters: date – YYYY-MM-DD Returns: response
-
hockey_scraper.nhl.pbp.espn_pbp.
get_espn_game
(date, home_team, away_team, game_id=None)¶ Gets the ESPN pbp feed Ex: http://www.espn.com/nhl/gamecast/data/masterFeed?lang=en&isAll=true&gameId=400885300
Parameters: - date – date of the game
- home_team – home team
- away_team – away team
- game_id – Game id of we already have it - for live scraping. None if not there
Returns: raw xml
-
hockey_scraper.nhl.pbp.espn_pbp.
get_espn_game_id
(date, home_team, away_team)¶ Scrapes the day’s schedule and gets the id for the given game Ex: http://www.espn.com/nhl/scoreboard/_/date/20161024
Parameters: - date – format-> YearMonthDay-> 20161024
- home_team – home team
- away_team – away team
Returns: 9 digit game id as a string
-
hockey_scraper.nhl.pbp.espn_pbp.
get_game_ids
(response)¶ Get game_ids for date from doc
Parameters: response – doc Returns: list of game_ids
-
hockey_scraper.nhl.pbp.espn_pbp.
get_teams
(response)¶ Extract Teams for date from doc
ul-> class = ScoreCell__Competitors
div -> class = ScoreCell__TeamName ScoreCell__TeamName–shortDisplayName truncate db
Parameters: response – doc Returns: list of teams
-
hockey_scraper.nhl.pbp.espn_pbp.
parse_espn
(espn_xml)¶ Parse feed
Parameters: espn_xml – raw xml of feed Returns: DataFrame with info
-
hockey_scraper.nhl.pbp.espn_pbp.
parse_event
(event)¶ Parse each event. In the string each field is separated by a ‘~’. Relevant for here: The first two are the x and y coordinates. And the 4th and 5th are the time elapsed and period.
Parameters: event – string with info Returns: return dict with relevant info
-
hockey_scraper.nhl.pbp.espn_pbp.
scrape_game
(date, home_team, away_team, game_id=None)¶ Scrape the game
Parameters: - date – ex: 2016-20-24
- home_team – tricode
- away_team – tricode
- game_id – Only provided for live games.
Returns: DataFrame with info
Json Shifts¶
This module contains functions to scrape the Json toi/shifts for any given game
-
hockey_scraper.nhl.shifts.json_shifts.
fix_team_tricode
(tricode)¶ Some of the tricodes are different than how I want them
Parameters: tricode – 3 letter team name - ex: NYR Returns: fixed tricode
-
hockey_scraper.nhl.shifts.json_shifts.
get_shifts
(game_id)¶ Given a game_id it returns the raw json Ex: https://api.nhle.com/stats/rest/en/shiftcharts?cayenneExp=gameId=2019020001
Parameters: game_id – the game Returns: json or None
-
hockey_scraper.nhl.shifts.json_shifts.
parse_json
(shift_json, game_id)¶ Parse the json
Parameters: - shift_json – raw json
- game_id – if of game
Returns: DataFrame with info
-
hockey_scraper.nhl.shifts.json_shifts.
parse_shift
(shift)¶ Parse shift for json
Parameters: shift – json for shift Returns: dict with shift info
-
hockey_scraper.nhl.shifts.json_shifts.
scrape_game
(game_id)¶ Scrape the game.
Parameters: game_id – game Returns: DataFrame with info for the game
Html Shifts¶
This module contains functions to scrape the Html Toi Tables (or shifts) for any given game
-
hockey_scraper.nhl.shifts.html_shifts.
analyze_shifts
(shift, name, team, home_team, player_ids)¶ Analyze shifts for each player when using. Prior to this each player (in a dictionary) has a list with each entry being a shift.
Parameters: - shift – info on shift
- name – player name
- team – given team
- home_team – home team for given game
- player_ids – dict with info on players
Returns: dict with info for shift
-
hockey_scraper.nhl.shifts.html_shifts.
get_shifts
(game_id)¶ Given a game_id it returns a the shifts for both teams Ex: http://www.nhl.com/scores/htmlreports/20162017/TV020971.HTM
Parameters: game_id – the game Returns: Shifts or None
-
hockey_scraper.nhl.shifts.html_shifts.
get_soup
(shifts_html)¶ Uses Beautiful soup to parses the html document. Some parsers work for some pages but don’t work for others….I’m not sure why so I just try them all here in order
Parameters: shifts_html – html doc Returns: “soupified” html and player_shifts portion of html (it’s a bunch of td tags)
-
hockey_scraper.nhl.shifts.html_shifts.
get_teams
(soup)¶ Return the team for the TOI tables and the home team
Parameters: soup – souped up html Returns: list with team and home team
-
hockey_scraper.nhl.shifts.html_shifts.
parse_html
(html, player_ids, game_id)¶ Parse the html
Note: Don’t fuck with this!!! I’m not exactly sure how or why but it works.
Parameters: - html – cleaned up html
- player_ids – dict of home and away players
- game_id – id for game
Returns: DataFrame with info
-
hockey_scraper.nhl.shifts.html_shifts.
scrape_game
(game_id, players)¶ Scrape the game.
Parameters: - game_id – id for game
- players – list of players
Returns: DataFrame with info for the game
Schedule¶
This module contains functions to scrape the json schedule for any games or date range
-
hockey_scraper.nhl.json_schedule.
chunk_schedule_calls
(from_date, to_date)¶ The schedule endpoint sucks when handling a big date range. So instead I call in increments of n days.
Parameters: - date_from – scrape from this date
- date_to – scrape until this date
Returns: raw json of schedule of date range
-
hockey_scraper.nhl.json_schedule.
get_dates
(games)¶ Given a list game_ids it returns the dates for each game.
We go from the beginning of the earliest season in the sample to the end of the most recent
Parameters: games – list with game_id’s ex: 2016020001 Returns: list with game_id and corresponding date for all games
-
hockey_scraper.nhl.json_schedule.
get_schedule
(date_from, date_to)¶ Scrapes games in date range Ex: https://statsapi.web.nhl.com/api/v1/schedule?startDate=2010-10-03&endDate=2011-06-20
Parameters: - date_from – scrape from this date
- date_to – scrape until this date
Returns: raw json of schedule of date range
-
hockey_scraper.nhl.json_schedule.
scrape_schedule
(date_from, date_to, preseason=False, not_over=False)¶ Calls getSchedule and scrapes the raw schedule Json
Parameters: - date_from – scrape from this date
- date_to – scrape until this date
- preseason – Boolean indicating whether include preseason games (default if False)
- not_over – Boolean indicating whether we scrape games not finished. Means we relax the requirement of checking if the game is over.
Returns: list with all the game id’s
Playing Roster¶
This module contains functions to scrape the Html game roster for any given game
-
hockey_scraper.nhl.playing_roster.
fix_name
(player)¶ Get rid of (A) or (C) when a player has it attached to their name
Parameters: player – list of player info -> [number, position, name] Returns: fixed list
-
hockey_scraper.nhl.playing_roster.
get_coaches
(soup)¶ scrape head coaches
Parameters: soup – html Returns: dict of coaches for game
-
hockey_scraper.nhl.playing_roster.
get_content
(roster)¶ Uses Beautiful soup to parses the html document. Some parsers work for some pages but don’t work for others….I’m not sure why so I just try them all here in order
Parameters: roster – doc Returns: players and coaches
-
hockey_scraper.nhl.playing_roster.
get_players
(soup)¶ scrape roster for players
Parameters: soup – html Returns: dict for home and away players
-
hockey_scraper.nhl.playing_roster.
get_roster
(game_id)¶ Given a game_id it returns the raw html Ex: http://www.nhl.com/scores/htmlreports/20162017/RO020475.HTM
Parameters: game_id – the game Returns: raw html of game
-
hockey_scraper.nhl.playing_roster.
scrape_roster
(game_id)¶ For a given game scrapes the roster
Parameters: game_id – id for game Returns: dict of players (home and away) an dict for both head coaches
Save Pages¶
Saves the scraped docs so you don’t have to re-scrape them every time you want to parse the docs.
**** Don’t mess with this unless you know what you’re doing ****
-
hockey_scraper.utils.save_pages.
check_file_exists
(file_info)¶ Checks if the file exists. Also check if structure for holding scraped file exists to. If not, it creates it.
Parameters: file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in. Returns: Boolean - True if it exists
-
hockey_scraper.utils.save_pages.
create_file_path
(file_info)¶ Creates the file path for a given file
Parameters: file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in. Returns: path
-
hockey_scraper.utils.save_pages.
create_season_dirs
(file_info)¶ Creates the infrastructure to hold all the scraped docs for a season
Parameters: file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in. Returns: None
-
hockey_scraper.utils.save_pages.
get_page
(file_info)¶ Get the file so we don’t need to re-scrape
Parameters: file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in. Returns: Response or None
-
hockey_scraper.utils.save_pages.
save_page
(page, file_info)¶ Save the page we just scraped.
Note: It’ll only get saved if the directory already exists!!!!!!. I’m not dealing with any fuck ups. That would involve checking if it’s even a valid path and creating it. Make sure you get it right.
Parameters: - page – File scraped
- file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in.
Returns: None