Parsing SC2 replays for later analysis
I've realized I owe you an explanation on how to parse your own SC2 replays for the series of posts on Bayesian SC2 replay data analysis. Let's go through it here!
We'll use ZephyrBlu's zephyrus-sc2-parser library, which you can download via pip install zephyrus-sc2-parser
.
Parsing the replays¶
This process currently dumps a boatload of warnings and exceptions, so I'm choosing to wrap the former in a try
-except
and simply ignore those, and ignore the thrown warnings with warnings.simplefilter("ignore")
. Feel free to disable them on your end; but don't say I didn't warn you!
Note that this process takes a while, and we'll have to do some wrangling later on, so it makes more sense to parse all the replays first and have them all in memory for later. It might fail for larger datasets.
import pathlib
import warnings
import tqdm.auto as tqdm
import zephyrus_sc2_parser
REPLAY_DIRECTORY = "/home/dominik/Links/SC2Reps"
PLAYER_NAME = "Perfi"
with warnings.catch_warnings():
warnings.simplefilter("ignore")
replays = list(pathlib.Path(REPLAY_DIRECTORY).glob("*.SC2Replay"))
parsed_replays = {}
for replay_file in tqdm.tqdm(replays):
try:
replay = zephyrus_sc2_parser.parse_replay(replay_file, local=True)
except Exception as e:
print(f"Failed for {replay_file}: {e}")
continue
parsed_replays[replay_file] = replay
And I have absolutely no idea how to explain the Logging error. We aren't missing out on many games, though:
print(f"We successfully parsed {len(parsed_replays)} replays, which is {len(parsed_replays)/len(replays):.2%} of the total!")
That was the first step; now, we continue to...
Pull the interesting data¶
Note that this mostly handles 1v1 data; it might be a bit more difficult to filter out stuff such as coop and team games. I would probably recommend filtering them out at an earlier stage, by filename.
# utility function to get our own player ID
def grab_player_id(players, name = PLAYER_NAME):
for key, player in players.items():
if player.name == name:
break
else:
key = None
return key
results = []
for replay_file, replay in parsed_replays.items():
players, timeline, engagements, summary, meta = replay
if all(item is None for item in replay):
print(f"Failed to parse for {replay_file}")
continue
my_id = grab_player_id(players, PLAYER_NAME)
enemy_id = 1 if (my_id == 2) else 2
results.append(
dict(
replay_file = replay_file,
time_played_at = meta['time_played_at'],
win = meta["winner"] == my_id,
race = players[my_id].race,
enemy_race = players[enemy_id].race,
mmr = summary['mmr'][my_id],
enemy_mmr = summary['mmr'][enemy_id],
enemy_nickame = players[enemy_id].name,
map_name = meta["map"],
duration = meta['game_length'],
)
)
print(f"We successfully pulled data out of {len(results)} replays, which is {len(results)/len(replays):.2%} of the total!")
What I'm showing you here is the end result, but if you wanted to add some other metrics, you might be interested in the answer to:
How do I pick the interesting data?¶
We'll use the entries from the last replay. Most of them are dictionaries, so it's pretty easy to get access to their contents:
meta
If you run this notebook locally, IPython has a nice widget to browse this data. If you're reading this on the website, you'll probably unfortunately see only <IPython.core.display.JSON object>
:
from IPython.display import JSON
JSON(summary)
summary.keys()
As you can (possibly) see, there's plenty of interesting data that I might use sometime. Beyond what we're already pulling out:
- Average resource collection rate
- the spending quotient, a (possibly flawed) measure of macro skill
- time spent supply blocked
- workers lost, killed and produced
- per-race statistics:
- Orbital Command energy efficiency and idle time
- likewise for Nexii (Nexuses?)
- Splash efficiency for Protoss
I probably wouldn't use Bayesian inference on all of them, though - it gets hard to come up with a model that involves all of them. Maybe a random forest model would be nice?
Either way, once we've found something interesting it's simple to access the fields:
summary['apm'][1]
It's a bit more difficult to pull data out of players
, as there are dedicated objects storing the data there; we can still make do:
clean_data = {}
for player_id, player in players.items():
d = player.__dict__.copy()
# we have to drop some data that contains custom objects:
for dropped_key in ["current_selection", "objects", "control_groups", "pac_list", "current_pac", "active_ability"]:
d.pop(dropped_key)
clean_data[player_id] = d
JSON(clean_data)
I'll showcase a few:
players[2].upgrades
players[2].supply_block
players[2].resources_collected
A bunch of these keys, such as unspent_resources
, are time data, taken at discrete snapshots during the game. There's more time data, of course, in timeline
:
JSON(timeline)
And I haven't yet been able to figure this one out:
engagements
Saving our results to DataFrame, then to CSV¶
We'll also calculate the MMR difference at this step.
import pandas as pd
df = pd.DataFrame(results)
df['mmr_diff'] = df.mmr - df.enemy_mmr
df
And we dump that to CSV, and we're done!
df.to_csv("/home/dominik/Writing/blog/files/replays.csv")
TL;DR version¶
Feel free to take this script and modify as you see fit!
import pathlib
import warnings
import tqdm.auto as tqdm
import zephyrus_sc2_parser
REPLAY_DIRECTORY = "/home/dominik/Links/SC2Reps"
PLAYER_NAME = "Perfi"
OUTPUT_CSV = "/home/dominik/Writing/blog/files/replays.csv"
with warnings.catch_warnings():
warnings.simplefilter("ignore")
replays = list(pathlib.Path(REPLAY_DIRECTORY).glob("*.SC2Replay"))
parsed_replays = {}
for replay_file in tqdm.tqdm(replays):
try:
replay = zephyrus_sc2_parser.parse_replay(replay_file, local=True)
except Exception as e:
print(f"Failed for {replay_file}: {e}")
continue
parsed_replays[replay_file] = replay
print(f"We successfully pulled data out of {len(results)} replays, which is {len(results)/len(replays):.2%} of the total!")
# utility function to get our own player ID
def grab_player_id(players, name = PLAYER_NAME):
for key, player in players.items():
if player.name == name:
break
else:
key = None
return key
results = []
for replay_file, replay in parsed_replays.items():
players, timeline, engagements, summary, meta = replay
if all(item is None for item in replay):
print(f"Failed to parse for {replay_file}")
continue
my_id = grab_player_id(players, PLAYER_NAME)
enemy_id = 1 if (my_id == 2) else 2
mmr = summary['mmr'][my_id]
enemy_mmr = summary['mmr'][enemy_id]
results.append(
dict(
replay_file = replay_file,
time_played_at = meta['time_played_at'],
win = meta["winner"] == my_id,
mmr=mmr,
enemy_mmr=enemy_mmr,
mmr_diff = mmr - enemy_mmr
race = players[my_id].race,
enemy_race = players[enemy_id].race,
enemy_nickame = players[enemy_id].name,
map_name = meta["map"],
duration = meta['game_length'],
)
)
print(f"We successfully pulled data out of {len(results)} replays, which is {len(results)/len(replays):.2%} of the total!")
import pandas as pd
df = pd.DataFrame(results)
df['mmr_diff'] = df.mmr - df.enemy_mmr
df.to_csv(OUTPUT_CSV)
If you have questions about this sort of thing, I'll be happy to help - ask away! :)
Comments