NHL Scraping Functions

Scraping

There are three ways to scrape games:

1. Scrape by Season:

Scrape games on a season by season level (Note: A given season is referred to by the first of the two years it spans. So you would refer to the 2016-2017 season as 2016).

import hockey_scraper

 # Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file (both are equivalent!!!)
 hockey_scraper.scrape_seasons([2015, 2016], True)
 hockey_scraper.scrape_seasons([2015, 2016], True, data_format='Csv')

 # Scrapes the 2008 season without shifts and returns a dictionary with the DataFrame
 scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')

 # Scrapes 2014 season without shifts including preseason games
 hockey_scraper.scrape_seasons([2014], False, preseason=True)

2. Scrape by Game:

Scrape a list of games provided. All game ID’s can be found using this link (you need to play around with the dates in the url).

import hockey_scraper

# Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file
hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)

# Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a a dictionary with the DataFrames
scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')

3. Scrape by Date Range:

Scrape all games between a specified date range. All dates must be written in a “yyyy-mm-dd” format.

import hockey_scraper

# Scrapes all games between date range without shifts and stores the data in a Csv file (both are equivalent!!!)
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False, preseason=False)

# Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a a dictionary with the DataFrame
scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')

# Scrapes all games from 2014-09-15 to 2014-11-01 with shifts including preseason games
hockey_scraper.scrape_date_range('2014-09-15', '2014-11-01', True, preseason=True)

4. Scrape Schedule

Scrape the schedule between any given date range for past and future games. All dates must be written in a “yyyy-mm-dd” format. The default data_format is equal to ‘Pandas’. This returns a DataFrame and not a dictionary like others. The columns returned are: [‘game_id’, ‘date’, ‘venue’, ‘home_team’, ‘away_team’, ‘start_time’, ‘home_score’, ‘away_score’, ‘status’]

import hockey_scraper

sched_df = hockey_scraper.scrape_schedule("2019-10-01", "2020-07-01")

Persistent Data

The option also exists to save the scraped files in another directory. This would speed up re-scraping any games since we already have the docs needed for it. It would also be useful if you want to grab any extra information from them as some of them contain a lot more information. In order to do this you can use the ‘docs_dir’ keyword. One can specify the boolean value True to either create or refer (to an already created) directory in the home directory called hockey_scraper data. Or you can specify the directory with the string of the path. If this is a valid directory, when scraping each page it would first check if it was already scraped (therefore saving us the time of scraping it). If it hasn’t been scraped yet, it will then grab it from the source and save it in the given directory.

Sometimes you may have already scraped and saved a file but you want to re-scrape it from the source and save it again (this may seem strange but the NHL frequently fixes mistakes so you may want to update what you have). This can be done by setting the keyword argument rescrape equal to True.

import hockey_scraper

# Path to the given directory
# Can also be True if you want the scraper to take care of it
USER_PATH = "/...."

# Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file
# Also includes a path for an existing directory for the scraped files to be placed in or retrieved from.
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH)

# Once could chose to re-scrape previously saved files by making the keyword argument rescrape=True
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)

Additional Notes:

1. For all three functions you must specify if you want to also scrape shifts (TOI tables) with a boolean. The Play by Play is automatically scraped.

2. When scraping by date range or by season, preseason games aren’t scraped unless otherwise specified. Also preseason games are scraped at your own risk. There is no guarantee it will work or that the files are even there!!!

3. For all three functions the scraped data is deposited into a Csv file unless it’s specified to return the DataFrames

4. The Dictionary with the DataFrames (and scraping errors) returned by setting data_format=’Pandas’ is structured like:

{
   # Both of these are always included
   'pbp': pbp_df,

   # This is only included when the argument 'if_scrape_shifts' is set equal to True
   'shifts': shifts_df
 }

5. When including a directory, it must be a valid directory. It will not create it for you. You’ll get an error message but otherwise it will scrape as if no directory was provided.

Scrape Functions

Functions to scrape by season, games, and date range

hockey_scraper.nhl.scrape_functions.print_errors(detailed=True)

Print errors with scraping.

Detailed parameter controls if certain errors should be re-printed after scraping all games. For example if the pbp for a game is broken it’s always printed immediately after that game. But a summary of broken games will be printed if over 25 games are scraped. The logic is that it’ll be easier when you’ve scraped a lot of games to see all the errors at the end than scrolling though all the output and potentially missing it.

Parameters:detailed – When False only print player IDs otherwise all
Returns:None
hockey_scraper.nhl.scrape_functions.scrape_date_range(from_date, to_date, if_scrape_shifts, data_format='csv', preseason=False, rescrape=False, docs_dir=False, verbose=False)

Scrape games in given date range

Parameters:
  • from_date – date you want to scrape from
  • to_date – date you want to scrape to
  • if_scrape_shifts – Boolean indicating whether to also scrape shifts
  • data_format – format you want data in - csv or pandas (csv is default)
  • preseason – Boolean indicating whether to include preseason games (default if False) This is may or may not work!!! I don’t give a shit.
  • rescrape – If you want to rescrape pages already scraped. Only applies if you supply a docs dir. (def. = None)
  • docs_dir – Directory that either contains previously scraped docs or one that you want them to be deposited in after scraping. When True it’ll refer to (or if needed create) such a repository in the home directory. When provided a string it’ll try to use that. Here it must be a valid directory otheriwse it won’t work (I won’t make it for you). When False the files won’t be saved.
Params verbose:

Override default verbosity when printing errors

Returns:

Dictionary with DataFrames and errors or None

hockey_scraper.nhl.scrape_functions.scrape_games(games, if_scrape_shifts, data_format='csv', rescrape=False, docs_dir=False, verbose=False)

Scrape a list of games

Parameters:
  • games – list of game_ids
  • if_scrape_shifts – Boolean indicating whether to also scrape shifts
  • data_format – format you want data in - csv or pandas (csv is default)
  • rescrape – If you want to rescrape pages already scraped. Only applies if you supply a docs dir.
  • docs_dir – Directory that either contains previously scraped docs or one that you want them to be deposited in after scraping. When True it’ll refer to (or if needed create) such a repository in the home directory. When provided a string it’ll try to use that. Here it must be a valid directory otheriwse it won’t work (I won’t make it for you). When False the files won’t be saved.
Params verbose:

Override default verbosity when printing errors

Returns:

Dictionary with DataFrames and errors or None

hockey_scraper.nhl.scrape_functions.scrape_list_of_games(games, if_scrape_shifts, verbose=False)

Given a list of game_id’s (and a date for each game) it scrapes them

Parameters:
  • games – list of [game_id, date]
  • if_scrape_shifts – Boolean indicating whether to also scrape shifts
Params verbose:

Verbosity when printing errors. Defaults to False

Returns:

DataFrame of pbp info, also shifts if specified

hockey_scraper.nhl.scrape_functions.scrape_schedule(from_date, to_date, data_format='pandas', rescrape=False, docs_dir=False)

Scrape the games schedule in a given range.

Parameters:
  • from_date – date you want to scrape from
  • to_date – date you want to scrape to
  • data_format – format you want data in - csv or pandas (pandas is default)
  • rescrape – If you want to rescrape pages already scraped. Only applies if you supply a docs dir. (def. = None)
  • docs_dir – Directory that either contains previously scraped docs or one that you want them to be deposited in after scraping. When True it’ll refer to (or if needed create) such a repository in the home directory. When provided a string it’ll try to use that. Here it must be a valid directory otheriwse it won’t work (I won’t make it for you). When False the files won’t be saved.
Returns:

DataFrame of None

hockey_scraper.nhl.scrape_functions.scrape_seasons(seasons, if_scrape_shifts, data_format='csv', preseason=False, rescrape=False, docs_dir=False, verbose=False)

Given list of seasons it scrapes all the seasons

Parameters:
  • seasons – list of seasons
  • if_scrape_shifts – Boolean indicating whether to also scrape shifts
  • data_format – format you want data in - csv or pandas (csv is default)
  • preseason – Boolean indicating whether to include preseason games (default if False) This is may or may not work!!! I don’t give a shit.
  • rescrape – If you want to rescrape pages already scraped. Only applies if you supply a docs dir.
  • docs_dir – Directory that either contains previously scraped docs or one that you want them to be deposited in after scraping. When True it’ll refer to (or if needed create) such a repository in the home directory. When provided a string it’ll try to use that. Here it must be a valid directory otheriwse it won’t work (I won’t make it for you). When False the files won’t be saved.
Params verbose:

Override default verbosity when printing errors

Returns:

Dictionary with DataFrames and errors or None

Game Scraper

This module contains code to scrape data for a single game

hockey_scraper.nhl.game_scraper.check_goalie(row)

Checks for bad goalie names (you can tell by them having no player id)

Parameters:row – df row
Returns:None
hockey_scraper.nhl.game_scraper.combine_espn_html_pbp(html_df, espn_df, game_id, date, away_team, home_team)

Merge the coordinate from the espn feed into the html DataFrame

Can’t join here on event_id because the plays are often out of order and pre-2009 are often missing events.

Parameters:
  • html_df – DataFrame with info from html pbp
  • espn_df – DataFrame with info from espn pbp
  • game_id – json game id
  • date – ex: 2016-10-24
  • away_team – away team
  • home_team – home team
Returns:

merged DataFrame

hockey_scraper.nhl.game_scraper.combine_html_json_pbp(json_df, html_df, game_id, date)

Join both data sources. First try merging on event id (which is the DataFrame index) if both DataFrames have the same number of rows. If they don’t have the same number of rows, merge on: Period’, Event, Seconds_Elapsed, p1_ID.

Parameters:
  • json_df – json pbp DataFrame
  • html_df – html pbp DataFrame
  • game_id – id of game
  • date – date of game
Returns:

finished pbp

hockey_scraper.nhl.game_scraper.combine_players_lists(json_players, roster_players, game_id)

Combine the json list of players (which contains id’s) with the list in the roster html

Parameters:
  • json_players – dict of all players with id’s
  • roster_players – dict with home and and away keys for players
  • game_id – id of game
Returns:

dict containing home and away keys -> which contains list of info on each player

hockey_scraper.nhl.game_scraper.get_players_json(game_json)

Return dict of players for that game by team

Parameters:players_json – players section of json
Returns:{team -> players}
hockey_scraper.nhl.game_scraper.get_teams_and_players(game_json, roster, game_id)

Get list of players and teams for game

Parameters:
  • game_json – json pbp for game
  • roster – players from roster html
  • game_id – id for game
Returns:

dict for both - players and teams

hockey_scraper.nhl.game_scraper.scrape_game(game_id, date, if_scrape_shifts)

This scrapes the info for the game. The pbp is automatically scraped, and the whether or not to scrape the shifts is left up to the user.

Parameters:
  • game_id – game to scrap
  • date – ex: 2016-10-24
  • if_scrape_shifts – Boolean indicating whether to also scrape shifts
Returns:

DataFrame of pbp info (optional) DataFrame with shift info otherwise just None

hockey_scraper.nhl.game_scraper.scrape_pbp(game_id, date, roster, game_json, players, teams, espn_id=None, html_df=None)

Scrape the Pbp info. The HTML is always scraped.

The Json is parse whe season >= 2010 and there are plays. Otherwise ESPN is gotten to supplement the HTML with coordinate.

The espn_id and the html data can be fed as keyword argument to speed up execution. This is used by the live game scraping class.

Parameters:
  • game_id – json game id
  • date – date of game
  • roster – list of players in pre game roster
  • game_json – json pbp for game
  • players – dict of players
  • teams – dict of teams
  • espn_id – Game Id for the espn game. Only provided when live scraping
  • html_df – Can provide DataFrame for html. Only done for live-scraping
Returns:

DataFrame with info or None if it fails

hockey_scraper.nhl.game_scraper.scrape_pbp_live(game_id, date, roster, game_json, players, teams, espn_id=None)

Wrapper for scraping the live pbp

Parameters:
  • game_id – json game id
  • date – date of game
  • roster – list of players in pre game roster
  • game_json – json pbp for game
  • players – dict of players
  • teams – dict of teams
  • espn_id – Game Id for the espn game. Only provided when live scraping
Returns:

Tuple - pbp & status

hockey_scraper.nhl.game_scraper.scrape_shifts(game_id, players, date)

Scrape the Shift charts (or TOI tables)

Parameters:
  • game_id – json game id
  • players – dict of players with numbers and id’s
  • date – date of game
Returns:

DataFrame with info or None if it fails

Html PBP

This module contains functions to scrape the Html Play by Play for any given game

hockey_scraper.nhl.pbp.html_pbp.add_event_players(event_dict, event, players, home_team)

Add players involved in the event to event_dict

Parameters:
  • event_dict – dict of parsed event stuff
  • event – fixed up html
  • players – dict of players and id’s
  • home_team – home team
Returns:

None

hockey_scraper.nhl.pbp.html_pbp.add_event_team(event_dict, event)

Add event team for event.

Always first thing in description

Parameters:
  • event_dict – dict of event info
  • event – list with parsed event info
Returns:

None

hockey_scraper.nhl.pbp.html_pbp.add_home_zone(event_dict, home_team)

Determines the zone relative to the home team and add it to event.

Keep in mind that the ‘ev_zone’ recorded is the zone relative to the event team. And for blocks the NHL counts the ev_team as the blocking team (I like counting the shooting team for blocks). Therefore, when it’s the home team the zone only gets flipped when it’s a block. For away teams it’s the opposite.

Parameters:
  • event_dict – dict of event info
  • home_team – home team
Returns:

None

hockey_scraper.nhl.pbp.html_pbp.add_period(event_dict, event)

Add period for event

Parameters:
  • event_dict – dict of event info
  • event – list with parsed event info
Returns:

None

hockey_scraper.nhl.pbp.html_pbp.add_score(event_dict, event, current_score, home_team)

Change if someone scored…also change current score

Parameters:
  • event_dict – dict of parsed event stuff
  • event – event info from pbp
  • current_score – current score in game
  • home_team – home team for game
Returns:

None

hockey_scraper.nhl.pbp.html_pbp.add_strength(event_dict, home_players, away_players)

Get strength for event -> It’s home then away

Parameters:
  • event_dict – dict of event info
  • home_players – list of players for home team
  • away_players – list of players for away team
Returns:

None

hockey_scraper.nhl.pbp.html_pbp.add_time(event_dict, event)

Fill in time and seconds elapsed

Parameters:
  • event_dict – dict of parsed event stuff
  • event – event info from pbp
Returns:

None

hockey_scraper.nhl.pbp.html_pbp.add_type(event_dict, event, players, home_team)

Add “type” for event -> either a penalty or a shot type

Parameters:
  • event_dict – dict of event info
  • event – list with parsed event info
  • players – dict of home and away players in game
  • home_team – home team for game
Returns:

None

hockey_scraper.nhl.pbp.html_pbp.add_zone(event_dict, play_description)

Determine which zone the play occurred in (unless one isn’t listed) and add it to dict

Parameters:
  • event_dict – dict of event info
  • play_description – the zone would be included here
Returns:

Off, Def, Neu, or NA

hockey_scraper.nhl.pbp.html_pbp.clean_html_pbp(html)

Get rid of html and format the data

Parameters:html – the requested url
Returns:a list with all the info
hockey_scraper.nhl.pbp.html_pbp.cur_game_status(doc)

Return the game status

Parameters:doc – Html text
Returns:String -> one of [‘Final’, ‘Intermission’, ‘Progress’]
hockey_scraper.nhl.pbp.html_pbp.get_contents(game_html)

Uses Beautiful soup to parses the html document. Some parsers work for some pages but don’t work for others….I’m not sure why so I just try them all here in order

Parameters:game_html – html doc
Returns:“soupified” html
hockey_scraper.nhl.pbp.html_pbp.get_pbp(game_id)

Given a game_id it returns the raw html Ex: http://www.nhl.com/scores/htmlreports/20162017/PL020475.HTM

Parameters:game_id – the game
Returns:raw html of game
hockey_scraper.nhl.pbp.html_pbp.get_penalty(play_description, players, home_team)

Get the penalty info

Parameters:
  • play_description – description of play field
  • players – all players with info
  • home_team – home team for game
Returns:

penalty info

hockey_scraper.nhl.pbp.html_pbp.get_player_name(number, players, team, home_team)

This function is used for the description field in the html. Given a last name and a number it return the player’s full name and id. Done by searching in players for the team until we find him (then just break)

Parameters:
  • number – player’s number
  • players – all players with info
  • team – team of player listed in html
  • home_team – home team defined b4 hand (from json)
Returns:

dict with full and and id

hockey_scraper.nhl.pbp.html_pbp.if_valid_event(event)

Checks if it’s a valid event (‘#’ is meaningless and I don’t like those other one’s) to parse

Don’t remember what ‘GOFF’ is but ‘EGT’ is for emergency goaltender. The reason I get rid of it is because it’s not in the json and there’s another ‘EGPID’ that can be found in both (not sure why ‘EGT’ exists then).

Events ‘PGSTR’, ‘PGEND’, and ‘ANTHEM’ have been included at the start of each game for the 2017 season…I have no idea why.

Parameters:event – list of stuff in pbp
Returns:boolean
hockey_scraper.nhl.pbp.html_pbp.parse_block(description, players, home_team)

Parse the description field for a BLOCK

MTL #76 SUBBAN BLOCKED BY TOR #2 SCHENN, Wrist, Def. Zone

Parameters:
  • description – Play Description
  • players – players in game
  • home_team – Home Team for game
Returns:

Dict with info

hockey_scraper.nhl.pbp.html_pbp.parse_event(event, players, home_team, current_score)

Receives an event and parses it

Parameters:
  • event – event type
  • players – players in game
  • home_team – home team
  • current_score – current score for both teams
Returns:

dict with info

hockey_scraper.nhl.pbp.html_pbp.parse_fac(description, players, ev_team, home_team)

Parse the description field for a face-off MTL won Neu. Zone - MTL #11 GOMEZ vs TOR #37 BRENT

Parameters:
  • description – Play Description
  • players – players in game
  • ev_team – Event Team
  • home_team – Home Team for game
Returns:

Dict with info

hockey_scraper.nhl.pbp.html_pbp.parse_goal(description, players, ev_team, home_team)

Parse the description field for a GOAL

TOR #81 KESSEL(1), Wrist, Off. Zone, 14 ft. Assists: #42 BOZAK(1); #8 KOMISAREK(1)

Parameters:
  • description – Play Description
  • players – players in game
  • ev_team – Event Team
  • home_team – Home Team for game
Returns:

Dict with info

hockey_scraper.nhl.pbp.html_pbp.parse_hit(description, players, home_team)

Parse the description field for a HIT

MTL #20 O’BYRNE HIT TOR #18 BROWN, Def. Zone

Parameters:
  • description – Play Description
  • players – players in game
  • home_team – Home Team for game
Returns:

Dict with info

hockey_scraper.nhl.pbp.html_pbp.parse_html(html, players, teams)

Parse html game pbp

Parameters:
  • html – raw html
  • players – players in the game (from json pbp)
  • teams – dict with home and away teams
Returns:

DataFrame with info

hockey_scraper.nhl.pbp.html_pbp.parse_penalty(description, players, home_team)

Parse the description field for a Penalty

MTL #81 ELLER Hooking(2 min), Def. Zone Drawn By: TOR #11 SJOSTROM

Parameters:
  • description – Play Description
  • players – players in game
  • home_team – Home Team for game
Returns:

Dict with info

hockey_scraper.nhl.pbp.html_pbp.parse_shot_miss_take_give(description, players, ev_team, home_team)

Parse the description field for a: SHOT, MISS, TAKE, GIVE

MTL ONGOAL - #81 ELLER, Wrist, Off. Zone, 11 ft. ANA #23 BEAUCHEMIN, Slap, Wide of Net, Off. Zone, 42 ft. TOR GIVEAWAY - #35 GIGUERE, Def. Zone TOR TAKEAWAY - #9 ARMSTRONG, Off. Zone

Parameters:
  • description – Play Description
  • players – players in game
  • ev_team – Event Team
  • home_team – Home Team for game
Returns:

Dict with info

hockey_scraper.nhl.pbp.html_pbp.populate_players(event_dict, players, away_players, home_players)

Populate away and home player info (and num skaters on each side).

These include:
  1. HomePlayer & AwayPlayers fields from 1-6 for name/id
  2. Home & Away Goalie Fields for name/id
Parameters:
  • event_dict – dict with event info
  • players – all players in game and info
  • away_players – players for away team
  • home_players – players for home team
Returns:

None

hockey_scraper.nhl.pbp.html_pbp.return_name_html(info)

In the PBP html the name is in a format like: ‘Center - MIKE RICHARDS’ Some also have a hyphen in their last name so can’t just split by ‘-’

Parameters:info – position and name
Returns:name
hockey_scraper.nhl.pbp.html_pbp.scrape_game(game_id, players, teams)

Scrape the data for the game when not live

Parameters:
  • game_id – game to scrape
  • players – dict with player info
  • teams – dict with home and away teams
Returns:

DataFrame of game info or None if it fails

hockey_scraper.nhl.pbp.html_pbp.scrape_game_live(game_id, players, teams)

Scrape the data for the game when it’s live

Parameters:
  • game_id – game to scrape
  • players – dict with player info
  • teams – dict with home and away teams
Returns:

Tuple - get_pbp(), cur_game_status()

hockey_scraper.nhl.pbp.html_pbp.scrape_pbp(game_html, game_id, players, teams)

Scrape the data for the pbp

Parameters:
  • game_html – Html doc for the game
  • game_id – game to scrape
  • players – dict with player info
  • teams – dict with home and away teams
Returns:

DataFrame of game info or None if it fails

hockey_scraper.nhl.pbp.html_pbp.shot_type(play_description)

Determine which zone the play occurred in (unless one isn’t listed)

Parameters:play_description – the type would be in here
Returns:the type if it’s there (otherwise just NA)
hockey_scraper.nhl.pbp.html_pbp.strip_html_pbp(td)

Strip html tags and such. (Note to Self: Don’t touch this!!!)

Parameters:td – pbp
Returns:list of plays (which contain a list of info) stripped of html

Json PBP

This module contains functions to scrape the Json Play by Play for any given game

hockey_scraper.nhl.pbp.json_pbp.change_event_name(event)

Change event names from json style to html (ex: BLOCKED_SHOT to BLOCK).

Parameters:event – event type
Returns:fixed event type
hockey_scraper.nhl.pbp.json_pbp.get_pbp(game_id)

Given a game_id it returns the raw json Ex: http://statsapi.web.nhl.com/api/v1/game/2016020475/feed/live

Parameters:game_id – string - the game
Returns:raw json of game or None if couldn’t get game
hockey_scraper.nhl.pbp.json_pbp.get_teams(pbp_json)

Get teams

Parameters:pbp_json – raw play by play json
Returns:dict with home and away
hockey_scraper.nhl.pbp.json_pbp.parse_event(event)

Parses a single event when the info is in a json format

Parameters:event – json of event
Returns:dictionary with the info
hockey_scraper.nhl.pbp.json_pbp.parse_json(game_json, game_id)

Scrape the json for a game

Parameters:
  • game_json – raw json
  • game_id – game id for game
Returns:

Either a DataFrame with info for the game or None when fail

hockey_scraper.nhl.pbp.json_pbp.scrape_game(game_id)

Used for debugging

HTML depends on json so can’t follow this structure

Parameters:game_id – game to scrape
Returns:DataFrame of game info

Espn PBP

This module contains code to scrape coordinates for games off of espn for any given game

hockey_scraper.nhl.pbp.espn_pbp.event_type(play_description)

Returns the event type (ex: a SHOT or a GOAL…etc) given the event description.

Parameters:play_description – description of play
Returns:event
hockey_scraper.nhl.pbp.espn_pbp.get_espn_date(date)

Get the page that contains all the games for that day

Parameters:date – YYYY-MM-DD
Returns:response
hockey_scraper.nhl.pbp.espn_pbp.get_espn_game(date, home_team, away_team, game_id=None)

Gets the ESPN pbp feed Ex: http://www.espn.com/nhl/gamecast/data/masterFeed?lang=en&isAll=true&gameId=400885300

Parameters:
  • date – date of the game
  • home_team – home team
  • away_team – away team
  • game_id – Game id of we already have it - for live scraping. None if not there
Returns:

raw xml

hockey_scraper.nhl.pbp.espn_pbp.get_espn_game_id(date, home_team, away_team)

Scrapes the day’s schedule and gets the id for the given game Ex: http://www.espn.com/nhl/scoreboard/_/date/20161024

Parameters:
  • date – format-> YearMonthDay-> 20161024
  • home_team – home team
  • away_team – away team
Returns:

9 digit game id as a string

hockey_scraper.nhl.pbp.espn_pbp.get_game_ids(response)

Get game_ids for date from doc

Parameters:response – doc
Returns:list of game_ids
hockey_scraper.nhl.pbp.espn_pbp.get_teams(response)

Extract Teams for date from doc

ul-> class = ScoreCell__Competitors

div -> class = ScoreCell__TeamName ScoreCell__TeamName–shortDisplayName truncate db

Parameters:response – doc
Returns:list of teams
hockey_scraper.nhl.pbp.espn_pbp.parse_espn(espn_xml)

Parse feed

Parameters:espn_xml – raw xml of feed
Returns:DataFrame with info
hockey_scraper.nhl.pbp.espn_pbp.parse_event(event)

Parse each event. In the string each field is separated by a ‘~’. Relevant for here: The first two are the x and y coordinates. And the 4th and 5th are the time elapsed and period.

Parameters:event – string with info
Returns:return dict with relevant info
hockey_scraper.nhl.pbp.espn_pbp.scrape_game(date, home_team, away_team, game_id=None)

Scrape the game

Parameters:
  • date – ex: 2016-20-24
  • home_team – tricode
  • away_team – tricode
  • game_id – Only provided for live games.
Returns:

DataFrame with info

Json Shifts

This module contains functions to scrape the Json toi/shifts for any given game

hockey_scraper.nhl.shifts.json_shifts.fix_team_tricode(tricode)

Some of the tricodes are different than how I want them

Parameters:tricode – 3 letter team name - ex: NYR
Returns:fixed tricode
hockey_scraper.nhl.shifts.json_shifts.get_shifts(game_id)

Given a game_id it returns the raw json Ex: https://api.nhle.com/stats/rest/en/shiftcharts?cayenneExp=gameId=2019020001

Parameters:game_id – the game
Returns:json or None
hockey_scraper.nhl.shifts.json_shifts.parse_json(shift_json, game_id)

Parse the json

Parameters:
  • shift_json – raw json
  • game_id – if of game
Returns:

DataFrame with info

hockey_scraper.nhl.shifts.json_shifts.parse_shift(shift)

Parse shift for json

Parameters:shift – json for shift
Returns:dict with shift info
hockey_scraper.nhl.shifts.json_shifts.scrape_game(game_id)

Scrape the game.

Parameters:game_id – game
Returns:DataFrame with info for the game

Html Shifts

This module contains functions to scrape the Html Toi Tables (or shifts) for any given game

hockey_scraper.nhl.shifts.html_shifts.analyze_shifts(shift, name, team, home_team, player_ids)

Analyze shifts for each player when using. Prior to this each player (in a dictionary) has a list with each entry being a shift.

Parameters:
  • shift – info on shift
  • name – player name
  • team – given team
  • home_team – home team for given game
  • player_ids – dict with info on players
Returns:

dict with info for shift

hockey_scraper.nhl.shifts.html_shifts.get_shifts(game_id)

Given a game_id it returns a the shifts for both teams Ex: http://www.nhl.com/scores/htmlreports/20162017/TV020971.HTM

Parameters:game_id – the game
Returns:Shifts or None
hockey_scraper.nhl.shifts.html_shifts.get_soup(shifts_html)

Uses Beautiful soup to parses the html document. Some parsers work for some pages but don’t work for others….I’m not sure why so I just try them all here in order

Parameters:shifts_html – html doc
Returns:“soupified” html and player_shifts portion of html (it’s a bunch of td tags)
hockey_scraper.nhl.shifts.html_shifts.get_teams(soup)

Return the team for the TOI tables and the home team

Parameters:soup – souped up html
Returns:list with team and home team
hockey_scraper.nhl.shifts.html_shifts.parse_html(html, player_ids, game_id)

Parse the html

Note: Don’t fuck with this!!! I’m not exactly sure how or why but it works.

Parameters:
  • html – cleaned up html
  • player_ids – dict of home and away players
  • game_id – id for game
Returns:

DataFrame with info

hockey_scraper.nhl.shifts.html_shifts.scrape_game(game_id, players)

Scrape the game.

Parameters:
  • game_id – id for game
  • players – list of players
Returns:

DataFrame with info for the game

Schedule

This module contains functions to scrape the json schedule for any games or date range

hockey_scraper.nhl.json_schedule.chunk_schedule_calls(from_date, to_date)

The schedule endpoint sucks when handling a big date range. So instead I call in increments of n days.

Parameters:
  • date_from – scrape from this date
  • date_to – scrape until this date
Returns:

raw json of schedule of date range

hockey_scraper.nhl.json_schedule.get_dates(games)

Given a list game_ids it returns the dates for each game.

We sort all the games and retrieve the schedule from the beginning of the season from the earliest game until the end of most recent season.

Parameters:games – list with game_id’s ex: 2016020001
Returns:list with game_id and corresponding date for all games
hockey_scraper.nhl.json_schedule.get_schedule(date_from, date_to)

Scrapes games in date range Ex: https://statsapi.web.nhl.com/api/v1/schedule?startDate=2010-10-03&endDate=2011-06-20

Parameters:
  • date_from – scrape from this date
  • date_to – scrape until this date
Returns:

raw json of schedule of date range

hockey_scraper.nhl.json_schedule.scrape_schedule(date_from, date_to, preseason=False, not_over=False)

Calls getSchedule and scrapes the raw schedule Json

Parameters:
  • date_from – scrape from this date
  • date_to – scrape until this date
  • preseason – Boolean indicating whether include preseason games (default if False)
  • not_over – Boolean indicating whether we scrape games not finished. Means we relax the requirement of checking if the game is over.
Returns:

list with all the game id’s

Playing Roster

This module contains functions to scrape the Html game roster for any given game

hockey_scraper.nhl.playing_roster.fix_name(player)

Get rid of (A) or (C) when a player has it attached to their name

Parameters:player – list of player info -> [number, position, name]
Returns:fixed list
hockey_scraper.nhl.playing_roster.get_coaches(soup)

scrape head coaches

Parameters:soup – html
Returns:dict of coaches for game
hockey_scraper.nhl.playing_roster.get_content(roster)

Uses Beautiful soup to parses the html document. Some parsers work for some pages but don’t work for others….I’m not sure why so I just try them all here in order

Parameters:roster – doc
Returns:players and coaches
hockey_scraper.nhl.playing_roster.get_players(soup)

scrape roster for players

Parameters:soup – html
Returns:dict for home and away players
hockey_scraper.nhl.playing_roster.get_roster(game_id)

Given a game_id it returns the raw html Ex: http://www.nhl.com/scores/htmlreports/20162017/RO020475.HTM

Parameters:game_id – the game
Returns:raw html of game
hockey_scraper.nhl.playing_roster.scrape_roster(game_id)

For a given game scrapes the roster

Parameters:game_id – id for game
Returns:dict of players (home and away) an dict for both head coaches

Save Pages

Saves the scraped docs so you don’t have to re-scrape them every time you want to parse the docs.

**** Don’t mess with this unless you know what you’re doing ****

hockey_scraper.utils.save_pages.check_file_exists(file_info)

Checks if the file exists. Also check if structure for holding scraped file exists to. If not, it creates it.

Parameters:file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in.
Returns:Boolean - True if it exists
hockey_scraper.utils.save_pages.create_base_file_path(file_info)

Creates the base file path for a given file

Parameters:file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in.
Returns:path
hockey_scraper.utils.save_pages.create_dir_structure(dir_name)

Create the basic directory structure for docs_dir if not done yet. Creates the docs and csvs subdir if it doesn’t exist

Parameters:dir_name – Name of dir to create

:return None

hockey_scraper.utils.save_pages.create_season_dirs(file_info)

Creates the infrastructure to hold all the scraped docs for a season

Parameters:file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in.
Returns:None
hockey_scraper.utils.save_pages.get_page(file_info)

Get the file so we don’t need to re-scrape.

Try both compressed and non-compressed for backwards compatability issues (formerly non-compressed)

Parameters:file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in.
Returns:Response or None
hockey_scraper.utils.save_pages.is_compressed(file_info)

Check if stored file is compressed as we used to not save them as compressed.

Parameters:file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in.

return Boolean

hockey_scraper.utils.save_pages.save_page(page, file_info)

Save the page we just scraped.

Note: It’ll only get saved if the directory already exists!!!!!!. I’m not dealing with any fuck ups. That would involve checking if it’s even a valid path and creating it. Make sure you get it right.

Parameters:
  • page – File scraped
  • file_info – Dictionary containing the info on the file. Includes the name, season, file type, and the dir we want to deposit any data in.
Returns:

None

Shared Functions

This file is a bunch of the shared functions or just general stuff used by the different scrapers in the package.

hockey_scraper.utils.shared.add_dir(user_dir)

Add directory to store scraped docs if valid. Or create in the home dir

NOTE: After this functions docs_dir is either None or a valid directory

Parameters:user_dir – If bool=True create in the home dire or if user provided directory on their machine
Returns:None
hockey_scraper.utils.shared.check_data_format(data_format)

Checks if data_format specified (if it is at all) is either None, ‘Csv’, or ‘pandas’. It exits program with error message if input isn’t good.

Parameters:data_format – data_format provided
Returns:Boolean - True if good
hockey_scraper.utils.shared.check_valid_dates(from_date, to_date)

Check if it’s a valid date range

Parameters:
  • from_date – date should scrape from
  • to_date – date should scrape to
Returns:

None

hockey_scraper.utils.shared.convert_to_seconds(minutes)

Return minutes elapsed in time format to seconds elapsed

Parameters:minutes – time elapsed
Returns:time elapsed in seconds
hockey_scraper.utils.shared.convert_tricode(tri)

Convert the tri-code if found in ‘tri_code_conversion.json’

:return Fixed tri-code or original

hockey_scraper.utils.shared.custom_formatwarning(msg, *args, **kwargs)

Override format for standard wanings

hockey_scraper.utils.shared.fix_name(name)

Check if a name falls under those that need fixing. If it does…fix it.

Parameters:name – name in pbp
Returns:Either the given parameter or the fixed name
hockey_scraper.utils.shared.get_file(file_info, force=False)

Get the specified file.

If a docs_dir is provided we check if it exists. If it does we see if it contains that page (and saves if it doesn’t). If the docs_dir doesn’t exist we just scrape from the source and not save.

Parameters:
  • file_info – Dictionary containing the info for the file. Contains the url, name, type, and season
  • force – Force a rescrape. Default is False
Returns:

page

hockey_scraper.utils.shared.get_logger(python_file)

Create a basic logger to a log file

Parameters:python_file – File that instantiates the logger instance
Returns:logger
hockey_scraper.utils.shared.get_season(date)

Get Season based on from_date

There is an exception for the 2019-2020 pandemic season. Accoding to the below url:
Parameters:date – date
Returns:season -> ex: 2016 for 2016-2017 season
hockey_scraper.utils.shared.get_team(team)

Get the fucking team

hockey_scraper.utils.shared.if_rescrape(user_rescrape)

If you want to re_scrape. If someone is a dumbass and feeds it a non-boolean it terminates the program

Note: Only matters when you have a directory specified

Parameters:user_rescrape – Boolean
Returns:None
hockey_scraper.utils.shared.log_error(err, py_file)

Log error when Logging is specified

Parameters:
  • err – Error to log
  • python_file – File that instantiates the logger instance
Returns:

None

hockey_scraper.utils.shared.print_error(msg)

Implement own custom error using warning module. Prints in red

Reason why i still use warning for errors is so i can set to ignore them if i want to (e.g. live_scrape line 200).

Parameters:msg – Str to print
Returns:None
hockey_scraper.utils.shared.print_warning(msg)

Implement own custom warning using warning module. Prints in Orange.

Parameters:msg – Str to print
Returns:None
hockey_scraper.utils.shared.scrape_page(url)

Scrape a given url

Parameters:url – url for page
Returns:response object
hockey_scraper.utils.shared.season_end_bound(year)

Determine the end bound of a given season. Changes depending on if it’s the pandemic season or not

Parameters:year – str of year for given date
Returns:Datetime obj of last date in season
hockey_scraper.utils.shared.season_start_bound(year)

Get start bound for a season.

Notes:
  • There is a bug in the schedule API for 2016 that causes the pushback to 09-30
  • Pandemic season started in January
Parameters:year – str of year for given date
Returns:str of first date in season
hockey_scraper.utils.shared.to_csv(base_file_name, df, league, file_type)

Write DataFrame to csv file

Parameters:
  • base_file_name – name of file
  • df – DataFrame
  • league – nhl or nwhl
  • file_type – type of file despoiting
Returns:

None