Merging Excel Worksheets using Pandas
In my prior post, I loaded 2 playlists in an iTunes text export format into pandas dataframes to match into 1 single dataframe.
Here I repeat the last step of the prior post assuming that the 2 text files were first imported into excel.
For those who aren't familiar, this post is composed in the form of a Jupyter Notebook, which is an open document format that combines text, code, data, and graphics and is viewable through the web browser – if you have not used it before I encourage you to try it out!
You can download the notebook containing this post here, open it with Jupyter, and start asking your own questions of the data.
Setup notebook output options¶
Set notebook output formatting options:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
InteractiveShell.log_level = 'INFO'
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
Import standard libraries¶
Some standard Python package imports:
%matplotlib inline
import pandas as pd
import numpy as np
import logging
import sys
from collections import OrderedDict
log = logging.getLogger()
log.handlers = []
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
log.setLevel(logging.INFO)
# log.setLevel(logging.DEBUG)
log.addHandler(ch)
pd.set_option('display.width',1000)
# pd.set_option('max_colwidth',200)
pd.reset_option('max_colwidth')
Import Itunes Playlist Data¶
We'll start by exporting the itunes playlist data (for howto export itunes playlists itunes). I chose the text export option rather than xml.
After exporting, I opened and saved the file in excel. I now have 2 playlist excel files as seen in the example below.
Note that I could directly import the text formatted file. I use excel here to simply demonstrate that excel based imports will also work.
We now load the 2 playlists into list_of_playlists.
all_files = ['data/Top80_sPop&Rock.xlsx','data/TopAlt&Electronic.xlsx']
list_of_playlists= []
for file_ in all_files:
xls_file = pd.ExcelFile(file_)
df = xls_file.parse(0) ## load first sheet - there should only be one
df.drop_duplicates(subset=['Artist', 'Name'], inplace=True)
log.info("## %s: df.shape=%s" % (file_, str(df.shape)))
list_of_playlists.append(df)
inner_df = reduce(lambda left,right: pd.merge(left,right,on=['Artist','Name'], how='inner'), list_of_playlists)
merged_df = reduce(lambda left,right: pd.merge(left,right,on=['Artist','Name'], how='outer'), list_of_playlists)
log.info("inner_df.shape=%s" % str(inner_df.shape))
log.info("merged_df.shape=%s" % str(merged_df.shape))
df1_cnt=list_of_playlists[0].shape[0]
df2_cnt=list_of_playlists[1].shape[0]
df_inner_cnt=inner_df.shape[0]
df_union_cnt=df1_cnt+df2_cnt-df_inner_cnt
df_merged_cnt=merged_df.shape[0]
log.info("Rows in dataset1=%s" % df1_cnt)
log.info("Rows in dataset2=%s" % df2_cnt)
log.info("Rows in intersection=%s" % df_inner_cnt)
log.info("Rows in union=%s" % df_union_cnt)
log.info("(Rows in union==Rows in merged) = %s" % (df_union_cnt==df_merged_cnt))
2018-04-24 15:35:13,785 - INFO - ## data/Top80_sPop&Rock.xlsx: df.shape=(634, 31) 2018-04-24 15:35:13,857 - INFO - ## data/TopAlt&Electronic.xlsx: df.shape=(185, 31) 2018-04-24 15:35:13,882 - INFO - inner_df.shape=(67, 60) 2018-04-24 15:35:13,885 - INFO - merged_df.shape=(752, 60) 2018-04-24 15:35:13,888 - INFO - Rows in dataset1=634 2018-04-24 15:35:13,891 - INFO - Rows in dataset2=185 2018-04-24 15:35:13,892 - INFO - Rows in intersection=67 2018-04-24 15:35:13,894 - INFO - Rows in union=752 2018-04-24 15:35:13,897 - INFO - (Rows in union==Rows in merged) = True
So the data set union count agrees with the merged dataframe count.