{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Merging iTunes Playlists using Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*This notebook originally appeared as a [post](http://leeblog.org/blog/2017/12/18/simulating-chutes-and-ladders/) on the blog [Python Automation Journal](http://leeblog.org).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "I recently wanted to see how many songs were common between 2 of my iTunes playlists.\n", "\n", "Here I'll take a look at some of the basic questions you can answer with this data.\n", "Later I hope to find the time to dig deeper and ask some more interesting and creative questions – stay tuned!\n", "\n", "\n", "\n", "For those who aren't familiar, this post is composed in the form of a [Jupyter Notebook](https://jupyter.org/), which is an open document format that combines text, code, data, and graphics and is viewable through the web browser – if you have not used it before I encourage you to try it out!\n", "\n", "You can download the notebook containing this post [here](http://leeblog.org/downloads/notebooks/MergeItunesPlaylists.ipynb), open it with Jupyter, and start asking your own questions of the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup notebook output options\n", "\n", "Set notebook output formatting options:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = \"all\"\n", "InteractiveShell.log_level = 'INFO'\n", "\n", "from IPython.core.display import display, HTML\n", "display(HTML(\"\"))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import standard libraries\n", "\n", "Some standard Python package imports:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import pandas as pd\n", "import numpy as np\n", "import logging\n", "import sys\n", "import codecs\n", "\n", "from collections import OrderedDict\n", "\n", "log = logging.getLogger()\n", "log.handlers = []\n", "ch = logging.StreamHandler()\n", "ch.setLevel(logging.DEBUG)\n", "formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')\n", "ch.setFormatter(formatter)\n", "log.setLevel(logging.INFO)\n", "# log.setLevel(logging.DEBUG)\n", "log.addHandler(ch)\n", "\n", "pd.set_option('display.width',1000)\n", "# pd.set_option('max_colwidth',200)\n", "pd.reset_option('max_colwidth')\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Itunes Playlist Data\n", "\n", "We'll start by exporting the itunes playlist data (for howto export itunes playlists [itunes](https://support.apple.com/kb/PH19491?locale=en_US)).\n", "I chose the text export option rather than xml.\n", "\n", "We now load the 2 playlists into list_of_playlists.\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:30,966 - INFO - ## data/Top80_sPop&Rock.txt: df.head()=\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameArtistComposerAlbumGroupingWorkMovement NumberMovement CountMovement NameGenre...Volume AdjustmentKindEqualizerCommentsPlaysLast PlayedSkipsLast SkippedMy RatingLocation
0Just The Way You AreBilly JoelBilly JoelGreatest Hits Vol.INaNNaNNaNNaNNaNPop, 80s...NaNMPEG audio fileNaNNaN154.04/7/2018 9:49 AM2.012/27/2017 10:16 AM100C:\\Users\\Lee\\itunes\\Music\\Billy Joel\\Greatest ...
1She's Got A WayBilly JoelBilly JoelGreatest Hits Vol.IINaNNaNNaNNaNNaNPop, 80s...NaNMPEG audio fileNaNNaN144.03/25/2018 2:02 PM3.01/20/2018 8:18 AM100C:\\Users\\Lee\\itunes\\Music\\Billy Joel\\Greatest ...
2ThunderstruckBrian JohnsonNaNThe Very Best of AC/DCNaNNaNNaNNaNNaNHard Rock, 80s...NaNMPEG audio fileNaNNaN96.02/7/2018 11:01 AM7.01/8/2018 6:35 PM100C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Ver...
3Don't Stop Believin'JourneyJonathan Cain/Neal Schon/Steve PerryGreatest Hits (Limited Gold Edition)NaNNaNNaNNaNNaNPop, 80s...NaNMPEG audio fileNaNEncoded from Original CD album. Lame v3.98.2 ...86.02/9/2018 7:07 AM6.01/6/2018 12:45 PM100C:\\Users\\Lee\\itunes\\Music\\Journey\\Greatest Hit...
4Fade To GreyVisageNaNNew Wave Hits Of The '80s, Vol. 3NaNNaNNaNNaNNaNRock, 80s, Alternative...NaNMPEG audio fileNaNNaN62.04/9/2018 7:23 PMNaNNaN100C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...
\n", "

5 rows × 31 columns

\n", "
" ], "text/plain": [ " Name Artist Composer Album Grouping Work Movement Number Movement Count Movement Name Genre ... Volume Adjustment Kind Equalizer Comments Plays Last Played Skips Last Skipped My Rating Location\n", "0 Just The Way You Are Billy Joel Billy Joel Greatest Hits Vol.I NaN NaN NaN NaN NaN Pop, 80s ... NaN MPEG audio file NaN NaN 154.0 4/7/2018 9:49 AM 2.0 12/27/2017 10:16 AM 100 C:\\Users\\Lee\\itunes\\Music\\Billy Joel\\Greatest ...\n", "1 She's Got A Way Billy Joel Billy Joel Greatest Hits Vol.II NaN NaN NaN NaN NaN Pop, 80s ... NaN MPEG audio file NaN NaN 144.0 3/25/2018 2:02 PM 3.0 1/20/2018 8:18 AM 100 C:\\Users\\Lee\\itunes\\Music\\Billy Joel\\Greatest ...\n", "2 Thunderstruck Brian Johnson NaN The Very Best of AC/DC NaN NaN NaN NaN NaN Hard Rock, 80s ... NaN MPEG audio file NaN NaN 96.0 2/7/2018 11:01 AM 7.0 1/8/2018 6:35 PM 100 C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Ver...\n", "3 Don't Stop Believin' Journey Jonathan Cain/Neal Schon/Steve Perry Greatest Hits (Limited Gold Edition) NaN NaN NaN NaN NaN Pop, 80s ... NaN MPEG audio file NaN Encoded from Original CD album. Lame v3.98.2 ... 86.0 2/9/2018 7:07 AM 6.0 1/6/2018 12:45 PM 100 C:\\Users\\Lee\\itunes\\Music\\Journey\\Greatest Hit...\n", "4 Fade To Grey Visage NaN New Wave Hits Of The '80s, Vol. 3 NaN NaN NaN NaN NaN Rock, 80s, Alternative ... NaN MPEG audio file NaN NaN 62.0 4/9/2018 7:23 PM NaN NaN 100 C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...\n", "\n", "[5 rows x 31 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:31,000 - INFO - ## data/Top80_sPop&Rock.txt: df.shape=(678, 31)\n", "2018-04-24 15:30:31,012 - INFO - ## data/TopAlt&Electronic.txt: df.head()=\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameArtistComposerAlbumGroupingWorkMovement NumberMovement CountMovement NameGenre...Volume AdjustmentKindEqualizerCommentsPlaysLast PlayedSkipsLast SkippedMy RatingLocation
0The WalkerFitz & The TantrumsNaNMore Than Just a Dream (Deluxe Version)NaNNaNNaNNaNNaNAlternative...NaNMPEG audio fileNaNNaN176.012/5/2017 10:46 AM10.01/8/2018 7:45 PM100C:\\Users\\Lee\\itunes\\Music\\Fitz & The Tantrums\\...
1Jack The RipperMorrisseyNaNNaNNaNNaNNaNNaNNaNAlternative, Punk...NaNMPEG audio fileNaNNaN436.03/12/2018 7:59 PM13.09/28/2017 6:59 PM100C:\\Users\\Lee\\itunes\\Music\\Morrissey\\Unknown Al...
2HeathensTwenty One PilotsNaNSuicide Squad: The AlbumNaNNaNNaNNaNNaNIndie Rock, Alternative...NaNMPEG audio fileNaNNaN176.011/21/2017 2:29 PMNaNNaN100C:\\Users\\Lee\\itunes\\Music\\Twenty One Pilots\\Su...
3Fade To GreyVisageNaNNew Wave Hits Of The '80s, Vol. 3NaNNaNNaNNaNNaNRock, 80s, Alternative...NaNMPEG audio fileNaNNaN62.04/9/2018 7:23 PMNaNNaN100C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...
4Take A Chance On MeABBANaNThe Essential Collection (CD 1/2)NaNNaNNaNNaNNaNDisco/Dance/Electronica/Pop...NaNMPEG audio fileNaNNaN96.011/21/2017 2:37 PM23.010/30/2017 6:33 PM80C:\\Users\\Lee\\itunes\\Music\\ABBA\\The Essential C...
\n", "

5 rows × 31 columns

\n", "
" ], "text/plain": [ " Name Artist Composer Album Grouping Work Movement Number Movement Count Movement Name Genre ... Volume Adjustment Kind Equalizer Comments Plays Last Played Skips Last Skipped My Rating Location\n", "0 The Walker Fitz & The Tantrums NaN More Than Just a Dream (Deluxe Version) NaN NaN NaN NaN NaN Alternative ... NaN MPEG audio file NaN NaN 176.0 12/5/2017 10:46 AM 10.0 1/8/2018 7:45 PM 100 C:\\Users\\Lee\\itunes\\Music\\Fitz & The Tantrums\\...\n", "1 Jack The Ripper Morrissey NaN NaN NaN NaN NaN NaN NaN Alternative, Punk ... NaN MPEG audio file NaN NaN 436.0 3/12/2018 7:59 PM 13.0 9/28/2017 6:59 PM 100 C:\\Users\\Lee\\itunes\\Music\\Morrissey\\Unknown Al...\n", "2 Heathens Twenty One Pilots NaN Suicide Squad: The Album NaN NaN NaN NaN NaN Indie Rock, Alternative ... NaN MPEG audio file NaN NaN 176.0 11/21/2017 2:29 PM NaN NaN 100 C:\\Users\\Lee\\itunes\\Music\\Twenty One Pilots\\Su...\n", "3 Fade To Grey Visage NaN New Wave Hits Of The '80s, Vol. 3 NaN NaN NaN NaN NaN Rock, 80s, Alternative ... NaN MPEG audio file NaN NaN 62.0 4/9/2018 7:23 PM NaN NaN 100 C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...\n", "4 Take A Chance On Me ABBA NaN The Essential Collection (CD 1/2) NaN NaN NaN NaN NaN Disco/Dance/Electronica/Pop ... NaN MPEG audio file NaN NaN 96.0 11/21/2017 2:37 PM 23.0 10/30/2017 6:33 PM 80 C:\\Users\\Lee\\itunes\\Music\\ABBA\\The Essential C...\n", "\n", "[5 rows x 31 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:31,048 - INFO - ## data/TopAlt&Electronic.txt: df.shape=(186, 31)\n" ] } ], "source": [ "\n", "all_files = ['data/Top80_sPop&Rock.txt','data/TopAlt&Electronic.txt']\n", "list_of_playlists= []\n", "for file_ in all_files:\n", " doc = codecs.open(file_,'rU','UTF-16') \n", " df = pd.read_csv(doc,index_col=None, header=0,sep='\\t')\n", " log.info(\"## %s: df.head()=\" % file_)\n", " df.head()\n", " log.info(\"## %s: df.shape=%s\" % (file_, str(df.shape)))\n", " list_of_playlists.append(df)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vertically Stack the Playlist Data\n", "\n", "For our first example - I will load the playlists in list_of_playlists into dataframes and stack vertically.\n", "\n", "When we concatenate DataFrames, we need to specify the axis. axis=0 tells pandas to stack the second DataFrame under the first one. It will automatically detect whether the column names are the same and will stack accordingly. To stack the data vertically, we need to make sure we have the same columns and associated column format in both datasets. \n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:36,223 - INFO - ## vertical_stack.head()=\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameArtistComposerAlbumGroupingWorkMovement NumberMovement CountMovement NameGenre...Volume AdjustmentKindEqualizerCommentsPlaysLast PlayedSkipsLast SkippedMy RatingLocation
0Just The Way You AreBilly JoelBilly JoelGreatest Hits Vol.INaNNaNNaNNaNNaNPop, 80s...NaNMPEG audio fileNaNNaN154.04/7/2018 9:49 AM2.012/27/2017 10:16 AM100C:\\Users\\Lee\\itunes\\Music\\Billy Joel\\Greatest ...
1She's Got A WayBilly JoelBilly JoelGreatest Hits Vol.IINaNNaNNaNNaNNaNPop, 80s...NaNMPEG audio fileNaNNaN144.03/25/2018 2:02 PM3.01/20/2018 8:18 AM100C:\\Users\\Lee\\itunes\\Music\\Billy Joel\\Greatest ...
2ThunderstruckBrian JohnsonNaNThe Very Best of AC/DCNaNNaNNaNNaNNaNHard Rock, 80s...NaNMPEG audio fileNaNNaN96.02/7/2018 11:01 AM7.01/8/2018 6:35 PM100C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Ver...
3Don't Stop Believin'JourneyJonathan Cain/Neal Schon/Steve PerryGreatest Hits (Limited Gold Edition)NaNNaNNaNNaNNaNPop, 80s...NaNMPEG audio fileNaNEncoded from Original CD album. Lame v3.98.2 ...86.02/9/2018 7:07 AM6.01/6/2018 12:45 PM100C:\\Users\\Lee\\itunes\\Music\\Journey\\Greatest Hit...
4Fade To GreyVisageNaNNew Wave Hits Of The '80s, Vol. 3NaNNaNNaNNaNNaNRock, 80s, Alternative...NaNMPEG audio fileNaNNaN62.04/9/2018 7:23 PMNaNNaN100C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...
\n", "

5 rows × 31 columns

\n", "
" ], "text/plain": [ " Name Artist Composer Album Grouping Work Movement Number Movement Count Movement Name Genre ... Volume Adjustment Kind Equalizer Comments Plays Last Played Skips Last Skipped My Rating Location\n", "0 Just The Way You Are Billy Joel Billy Joel Greatest Hits Vol.I NaN NaN NaN NaN NaN Pop, 80s ... NaN MPEG audio file NaN NaN 154.0 4/7/2018 9:49 AM 2.0 12/27/2017 10:16 AM 100 C:\\Users\\Lee\\itunes\\Music\\Billy Joel\\Greatest ...\n", "1 She's Got A Way Billy Joel Billy Joel Greatest Hits Vol.II NaN NaN NaN NaN NaN Pop, 80s ... NaN MPEG audio file NaN NaN 144.0 3/25/2018 2:02 PM 3.0 1/20/2018 8:18 AM 100 C:\\Users\\Lee\\itunes\\Music\\Billy Joel\\Greatest ...\n", "2 Thunderstruck Brian Johnson NaN The Very Best of AC/DC NaN NaN NaN NaN NaN Hard Rock, 80s ... NaN MPEG audio file NaN NaN 96.0 2/7/2018 11:01 AM 7.0 1/8/2018 6:35 PM 100 C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Ver...\n", "3 Don't Stop Believin' Journey Jonathan Cain/Neal Schon/Steve Perry Greatest Hits (Limited Gold Edition) NaN NaN NaN NaN NaN Pop, 80s ... NaN MPEG audio file NaN Encoded from Original CD album. Lame v3.98.2 ... 86.0 2/9/2018 7:07 AM 6.0 1/6/2018 12:45 PM 100 C:\\Users\\Lee\\itunes\\Music\\Journey\\Greatest Hit...\n", "4 Fade To Grey Visage NaN New Wave Hits Of The '80s, Vol. 3 NaN NaN NaN NaN NaN Rock, 80s, Alternative ... NaN MPEG audio file NaN NaN 62.0 4/9/2018 7:23 PM NaN NaN 100 C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...\n", "\n", "[5 rows x 31 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:36,266 - INFO - vertical_stack.shape=(864, 31)\n" ] } ], "source": [ "\n", "# Stack the DataFrames on top of each other\n", "vertical_stack = pd.concat(list_of_playlists, axis=0)\n", "\n", "log.info(\"## vertical_stack.head()=\")\n", "vertical_stack.head()\n", "log.info(\"vertical_stack.shape=%s\" % str(vertical_stack.shape))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merge into one dataframe based on composite keys.\n", "\n", "We can also merge the list dataframe into one dataframe based on a common unique key - or multiple keys. \n", "Note that I refer to the term \"composite key\" when there multiple keys used to identify a record.\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:39,713 - INFO - ## merged_df.head()=\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameArtistComposer_xAlbum_xGrouping_xWork_xMovement Number_xMovement Count_xMovement Name_xGenre_x...Volume Adjustment_yKind_yEqualizer_yComments_yPlays_yLast Played_ySkips_yLast Skipped_yMy Rating_yLocation_y
0Fade To GreyVisageNaNNew Wave Hits Of The '80s, Vol. 3NaNNaNNaNNaNNaNRock, 80s, Alternative...NaNMPEG audio fileNaNNaN62.04/9/2018 7:23 PMNaNNaN100C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...
1I Could Be HappyAltered ImagesNaNNew Wave Hits Of The '80s, Vol. 04NaNNaNNaNNaNNaNPop, New Wave, 80s, Alternative...NaNMPEG audio fileNaNNaN66.04/11/2018 5:44 PM1.01/6/2018 8:21 PM80C:\\Users\\Lee\\itunes\\Music\\Altered Images\\New W...
2Dance This Mess AroundThe B-52'sCindy Wilson/Fred Schneider/Kate Pierson/Keith...B-52'sNaNNaNNaNNaNNaNAlternative, Punk, 80s...NaNMPEG audio fileNaNBlack1000151.04/10/2018 3:09 PM6.03/7/2018 11:58 AM80C:\\Users\\Lee\\itunes\\Music\\The B-52's\\B-52's\\03...
3Fight For Your RightBeastie BoysNaNThe Best 80s SongsNaN.NaNNaNNaN80s, Pop, Alternative...NaNMPEG audio fileNaNNaN53.04/9/2018 8:18 PM1.01/2/2018 1:37 AM80C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Bes...
4Just What I NeededThe CarsRic OcasekComplete Greatest HitsNaNNaNNaNNaNNaNPop, Electronic, 80s...NaNMPEG audio fileNaNNaN112.01/8/2018 12:44 PM4.01/6/2018 10:59 PM80C:\\Users\\Lee\\itunes\\Music\\The Cars\\Complete Gr...
\n", "

5 rows × 60 columns

\n", "
" ], "text/plain": [ " Name Artist Composer_x Album_x Grouping_x Work_x Movement Number_x Movement Count_x Movement Name_x Genre_x ... Volume Adjustment_y Kind_y Equalizer_y Comments_y Plays_y Last Played_y Skips_y Last Skipped_y My Rating_y Location_y\n", "0 Fade To Grey Visage NaN New Wave Hits Of The '80s, Vol. 3 NaN NaN NaN NaN NaN Rock, 80s, Alternative ... NaN MPEG audio file NaN NaN 62.0 4/9/2018 7:23 PM NaN NaN 100 C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...\n", "1 I Could Be Happy Altered Images NaN New Wave Hits Of The '80s, Vol. 04 NaN NaN NaN NaN NaN Pop, New Wave, 80s, Alternative ... NaN MPEG audio file NaN NaN 66.0 4/11/2018 5:44 PM 1.0 1/6/2018 8:21 PM 80 C:\\Users\\Lee\\itunes\\Music\\Altered Images\\New W...\n", "2 Dance This Mess Around The B-52's Cindy Wilson/Fred Schneider/Kate Pierson/Keith... B-52's NaN NaN NaN NaN NaN Alternative, Punk, 80s ... NaN MPEG audio file NaN Black1000 151.0 4/10/2018 3:09 PM 6.0 3/7/2018 11:58 AM 80 C:\\Users\\Lee\\itunes\\Music\\The B-52's\\B-52's\\03...\n", "3 Fight For Your Right Beastie Boys NaN The Best 80s Songs NaN . NaN NaN NaN 80s, Pop, Alternative ... NaN MPEG audio file NaN NaN 53.0 4/9/2018 8:18 PM 1.0 1/2/2018 1:37 AM 80 C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Bes...\n", "4 Just What I Needed The Cars Ric Ocasek Complete Greatest Hits NaN NaN NaN NaN NaN Pop, Electronic, 80s ... NaN MPEG audio file NaN NaN 112.0 1/8/2018 12:44 PM 4.0 1/6/2018 10:59 PM 80 C:\\Users\\Lee\\itunes\\Music\\The Cars\\Complete Gr...\n", "\n", "[5 rows x 60 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:39,750 - INFO - merged_df.shape=(72, 60)\n" ] } ], "source": [ "df_playlist1=list_of_playlists[0]\n", "df_playlist2=list_of_playlists[1]\n", "\n", "merged_df = pd.merge(df_playlist1, df_playlist2, on=['Artist','Name'])\n", "\n", "log.info(\"## merged_df.head()=\")\n", "merged_df.head()\n", "log.info(\"merged_df.shape=%s\" % str(merged_df.shape))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the Pandas merge operation acts with an “inner” merge. An inner merge, (or inner join) keeps only the common values in both the left and right dataframes for the result.\n", "\n", "Note that the # of columns minus the specified common keys has doubled.\n", "\n", "The merged dataframe preserves the original data columns from each of the datasets.\n", "\n", "The merged results has an appended suffix for each set of columns, \"_x\" and \"_y\" in the prior example. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inner, Left, and right merge types\n", "\n", "There are three different types of merges available in Pandas. These merge types are common across most database and data-orientated languages (SQL, R, SAS) and are typically referred to as “joins”. If you don’t know them, learn them now.\n", "\n", "1. Inner Merge / Inner join – The default Pandas behaviour, only keep rows where the merge “on” value exists in both the left and right dataframes.\n", "2. Left Merge / Left outer join – (aka left merge or left join) Keep every row in the left dataframe. Where there are missing values of the “on” variable in the right dataframe, add empty / NaN values in the result.\n", "3. Right Merge / Right outer join – (aka right merge or right join) Keep every row in the right dataframe. Where there are missing values of the “on” variable in the left column, add empty / NaN values in the result.\n", "4. Outer Merge / Full outer join – A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with NaNs elsewhere.\n", "\n", "The merge type to use is specified using the “how” parameter in the merge command, taking values “left”, “right”, “inner” (default), or “outer”.\n", "\n", "Venn diagrams are commonly used to exemplify the different merge and join types. See this example from [Stack overflow](http://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join):\n", "\n", "[img: Venn Diagram for merge types]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reduce into one dataframe based on the composite key\n", "\n", "We can also use reduce to merge the list dataframe into one.\n", "\n", "This gets the same results as the merge in the last example." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:44,378 - INFO - ## merged_df.head()=\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameArtistComposer_xAlbum_xGrouping_xWork_xMovement Number_xMovement Count_xMovement Name_xGenre_x...Volume Adjustment_yKind_yEqualizer_yComments_yPlays_yLast Played_ySkips_yLast Skipped_yMy Rating_yLocation_y
0Fade To GreyVisageNaNNew Wave Hits Of The '80s, Vol. 3NaNNaNNaNNaNNaNRock, 80s, Alternative...NaNMPEG audio fileNaNNaN62.04/9/2018 7:23 PMNaNNaN100C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...
1I Could Be HappyAltered ImagesNaNNew Wave Hits Of The '80s, Vol. 04NaNNaNNaNNaNNaNPop, New Wave, 80s, Alternative...NaNMPEG audio fileNaNNaN66.04/11/2018 5:44 PM1.01/6/2018 8:21 PM80C:\\Users\\Lee\\itunes\\Music\\Altered Images\\New W...
2Dance This Mess AroundThe B-52'sCindy Wilson/Fred Schneider/Kate Pierson/Keith...B-52'sNaNNaNNaNNaNNaNAlternative, Punk, 80s...NaNMPEG audio fileNaNBlack1000151.04/10/2018 3:09 PM6.03/7/2018 11:58 AM80C:\\Users\\Lee\\itunes\\Music\\The B-52's\\B-52's\\03...
3Fight For Your RightBeastie BoysNaNThe Best 80s SongsNaN.NaNNaNNaN80s, Pop, Alternative...NaNMPEG audio fileNaNNaN53.04/9/2018 8:18 PM1.01/2/2018 1:37 AM80C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Bes...
4Just What I NeededThe CarsRic OcasekComplete Greatest HitsNaNNaNNaNNaNNaNPop, Electronic, 80s...NaNMPEG audio fileNaNNaN112.01/8/2018 12:44 PM4.01/6/2018 10:59 PM80C:\\Users\\Lee\\itunes\\Music\\The Cars\\Complete Gr...
\n", "

5 rows × 60 columns

\n", "
" ], "text/plain": [ " Name Artist Composer_x Album_x Grouping_x Work_x Movement Number_x Movement Count_x Movement Name_x Genre_x ... Volume Adjustment_y Kind_y Equalizer_y Comments_y Plays_y Last Played_y Skips_y Last Skipped_y My Rating_y Location_y\n", "0 Fade To Grey Visage NaN New Wave Hits Of The '80s, Vol. 3 NaN NaN NaN NaN NaN Rock, 80s, Alternative ... NaN MPEG audio file NaN NaN 62.0 4/9/2018 7:23 PM NaN NaN 100 C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...\n", "1 I Could Be Happy Altered Images NaN New Wave Hits Of The '80s, Vol. 04 NaN NaN NaN NaN NaN Pop, New Wave, 80s, Alternative ... NaN MPEG audio file NaN NaN 66.0 4/11/2018 5:44 PM 1.0 1/6/2018 8:21 PM 80 C:\\Users\\Lee\\itunes\\Music\\Altered Images\\New W...\n", "2 Dance This Mess Around The B-52's Cindy Wilson/Fred Schneider/Kate Pierson/Keith... B-52's NaN NaN NaN NaN NaN Alternative, Punk, 80s ... NaN MPEG audio file NaN Black1000 151.0 4/10/2018 3:09 PM 6.0 3/7/2018 11:58 AM 80 C:\\Users\\Lee\\itunes\\Music\\The B-52's\\B-52's\\03...\n", "3 Fight For Your Right Beastie Boys NaN The Best 80s Songs NaN . NaN NaN NaN 80s, Pop, Alternative ... NaN MPEG audio file NaN NaN 53.0 4/9/2018 8:18 PM 1.0 1/2/2018 1:37 AM 80 C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Bes...\n", "4 Just What I Needed The Cars Ric Ocasek Complete Greatest Hits NaN NaN NaN NaN NaN Pop, Electronic, 80s ... NaN MPEG audio file NaN NaN 112.0 1/8/2018 12:44 PM 4.0 1/6/2018 10:59 PM 80 C:\\Users\\Lee\\itunes\\Music\\The Cars\\Complete Gr...\n", "\n", "[5 rows x 60 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:44,417 - INFO - merged_df.shape=(72, 60)\n" ] } ], "source": [ "\n", "import functools\n", "merged_df = functools.reduce(lambda left,right: pd.merge(left,right,on=['Artist','Name']), list_of_playlists)\n", "\n", "log.info(\"## merged_df.head()=\")\n", "merged_df.head()\n", "log.info(\"merged_df.shape=%s\" % str(merged_df.shape))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example of outer merge / full outer join\n", "\n", "Finally, we will perform an outer merge using Pandas, also referred to as a “full outer join” or just “outer join”. An outer join can be seen as a combination of left and right joins, or the opposite of an inner join. In outer joins, every row from the left and right dataframes is retained in the result, with NaNs where there are no matched join variables.\n", "\n", "As such, we would expect the results to have the same number of rows as there are distinct values of “use_id” between user_device and user_usage, i.e. every join value from the left dataframe will be in the result along with every value from the right dataframe, and they’ll be linked where possible.\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:48,407 - INFO - ## merged_df.head()=\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameArtistComposer_xAlbum_xGrouping_xWork_xMovement Number_xMovement Count_xMovement Name_xGenre_x...Volume Adjustment_yKind_yEqualizer_yComments_yPlays_yLast Played_ySkips_yLast Skipped_yMy Rating_yLocation_y
0Just The Way You AreBilly JoelBilly JoelGreatest Hits Vol.INaNNaNNaNNaNNaNPop, 80s...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1She's Got A WayBilly JoelBilly JoelGreatest Hits Vol.IINaNNaNNaNNaNNaNPop, 80s...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2ThunderstruckBrian JohnsonNaNThe Very Best of AC/DCNaNNaNNaNNaNNaNHard Rock, 80s...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3Don't Stop Believin'JourneyJonathan Cain/Neal Schon/Steve PerryGreatest Hits (Limited Gold Edition)NaNNaNNaNNaNNaNPop, 80s...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4Fade To GreyVisageNaNNew Wave Hits Of The '80s, Vol. 3NaNNaNNaNNaNNaNRock, 80s, Alternative...NaNMPEG audio fileNaNNaN62.04/9/2018 7:23 PMNaNNaN100.0C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...
\n", "

5 rows × 60 columns

\n", "
" ], "text/plain": [ " Name Artist Composer_x Album_x Grouping_x Work_x Movement Number_x Movement Count_x Movement Name_x Genre_x ... Volume Adjustment_y Kind_y Equalizer_y Comments_y Plays_y Last Played_y Skips_y Last Skipped_y My Rating_y Location_y\n", "0 Just The Way You Are Billy Joel Billy Joel Greatest Hits Vol.I NaN NaN NaN NaN NaN Pop, 80s ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN\n", "1 She's Got A Way Billy Joel Billy Joel Greatest Hits Vol.II NaN NaN NaN NaN NaN Pop, 80s ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN\n", "2 Thunderstruck Brian Johnson NaN The Very Best of AC/DC NaN NaN NaN NaN NaN Hard Rock, 80s ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN\n", "3 Don't Stop Believin' Journey Jonathan Cain/Neal Schon/Steve Perry Greatest Hits (Limited Gold Edition) NaN NaN NaN NaN NaN Pop, 80s ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN\n", "4 Fade To Grey Visage NaN New Wave Hits Of The '80s, Vol. 3 NaN NaN NaN NaN NaN Rock, 80s, Alternative ... NaN MPEG audio file NaN NaN 62.0 4/9/2018 7:23 PM NaN NaN 100.0 C:\\Users\\Lee\\itunes\\Music\\Visage\\New Wave Hits...\n", "\n", "[5 rows x 60 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:48,447 - INFO - merged_df.shape=(797, 60)\n" ] } ], "source": [ "df_playlist1=list_of_playlists[0]\n", "df_playlist2=list_of_playlists[1]\n", "\n", "merged_df = pd.merge(df_playlist1, \n", " df_playlist2, \n", " on=['Artist','Name'],\n", " how='outer')\n", "\n", "log.info(\"## merged_df.head()=\")\n", "merged_df.head()\n", "log.info(\"merged_df.shape=%s\" % str(merged_df.shape))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We noted in our prior example that the vertically stacked dataframe had a rowcount of 864 and we noted that the inner data set found from the initial merge count is 72 rows.\n", "\n", "In theory the rowcount of the outer join should be the total count minus the intersection - so 864 - 72 = 792.\n", "\n", "However, the count is off by 5. \n", "\n", "Something is off. \n", "\n", "My instincts are that the data sets are not truely unique based on the composite key that we used. This would lead to mismatched counts between the union count and the merged data set count.\n", "\n", "We can perform a simple check to validate if our instincts are correct.\n", "We now find out how many duplicate records exist based on the composite key we used in the first playlist dataset.\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:52,357 - INFO - df_dupes1.shape=(76, 31)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameArtistComposerAlbumGroupingWorkMovement NumberMovement CountMovement NameGenre...Volume AdjustmentKindEqualizerCommentsPlaysLast PlayedSkipsLast SkippedMy RatingLocation
122HalloweenDead KennedysNaNJust Can't Get Enough: New WaNaNNaNNaNNaNNaNAlternative, 80s...NaNMPEG audio fileNaNNaNNaNNaN4.03/19/2018 8:46 AM60C:\\Users\\Lee\\itunes\\Music\\Dead Kennedys\\Just C...
310HalloweenDead KennedysD.H. Peligro/Dead Kennedys/East Bay Ray/Jello ...Milking The Sacred CowNaNNaNNaNNaNNaNPunk Rock, 80s...NaNMPEG audio fileNaNNaNNaNNaN1.01/11/2018 11:59 PM40C:\\Users\\Lee\\itunes\\Music\\Dead Kennedys\\Milkin...
20Enjoy The SilenceDepeche ModeNaNBest Of Depeche Mode, Vol. 1NaNNaNNaNNaNNaNPop, 80s...NaNMPEG audio fileNaNTrack 440.04/10/2018 7:19 PM1.01/14/2018 10:37 AM80C:\\Users\\Lee\\itunes\\Music\\Depeche Mode\\Best Of...
127Enjoy The SilenceDepeche ModeMartin L. GoreViolatorNaNNaNNaNNaNNaNPop, 80s, Alternative...NaNMPEG audio fileNaNSire / Reprise 9 26081-227.04/9/2018 7:49 PM2.02/23/2018 1:33 PM60C:\\Users\\Lee\\itunes\\Music\\Depeche Mode\\Violato...
29The Final CountdownEuropeNaN101 Sporting AnthemsNaNNaNNaNNaNNaNHard Rock, 80s...NaNMPEG audio fileNaNNaN86.04/29/2016 3:29 PM5.09/28/2015 7:58 PM80C:\\Users\\Lee\\itunes\\Music\\Various Artists\\101 ...
358The Final CountdownEuropeNaNThe Best 80s SongsNaN.NaNNaNNaN80s, Pop...NaNMPEG audio fileNaNNaN3.01/2/2018 5:47 PM2.01/6/2018 9:49 PM40C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Bes...
36Down UnderMen At WorkNaNThe Best 80s SongsNaN.NaNNaNNaN80s, Pop...NaNMPEG audio fileNaNNaN2.01/2/2018 7:24 PM1.01/2/2018 1:44 AM80C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Bes...
37Down UnderMen At WorkColin Hay, Roy StrykertLike, Omigod! The 80s Pop Culture Box (Totally...NaNNaNNaNNaNNaNPop, 80s...NaNMPEG audio fileNaNNaN49.04/11/2018 6:47 PM3.01/17/2018 6:53 PM80C:\\Users\\Lee\\itunes\\Music\\Men At Work\\Like, Om...
45Cum On Feel The NoizeQuiet RiotNaN101 Sporting AnthemsNaNNaNNaNNaNNaNHard Rock, 80s...NaNMPEG audio fileNaNNaN79.04/29/2016 4:44 PM5.01/8/2018 3:01 PM80C:\\Users\\Lee\\itunes\\Music\\Various Artists\\101 ...
485Cum On Feel The NoizeQuiet RiotNoddy Holder, Jim LeaLike, Omigod! The 80s Pop Culture Box (Totally...NaNNaNNaNNaNNaNPop, 80s...NaNMPEG audio fileNaNNaN3.03/30/2018 10:27 AM2.01/9/2018 6:04 PM40C:\\Users\\Lee\\itunes\\Music\\Compilations\\Like, O...
\n", "

10 rows × 31 columns

\n", "
" ], "text/plain": [ " Name Artist Composer Album Grouping Work Movement Number Movement Count Movement Name Genre ... Volume Adjustment Kind Equalizer Comments Plays Last Played Skips Last Skipped My Rating Location\n", "122 Halloween Dead Kennedys NaN Just Can't Get Enough: New Wa NaN NaN NaN NaN NaN Alternative, 80s ... NaN MPEG audio file NaN NaN NaN NaN 4.0 3/19/2018 8:46 AM 60 C:\\Users\\Lee\\itunes\\Music\\Dead Kennedys\\Just C...\n", "310 Halloween Dead Kennedys D.H. Peligro/Dead Kennedys/East Bay Ray/Jello ... Milking The Sacred Cow NaN NaN NaN NaN NaN Punk Rock, 80s ... NaN MPEG audio file NaN NaN NaN NaN 1.0 1/11/2018 11:59 PM 40 C:\\Users\\Lee\\itunes\\Music\\Dead Kennedys\\Milkin...\n", "20 Enjoy The Silence Depeche Mode NaN Best Of Depeche Mode, Vol. 1 NaN NaN NaN NaN NaN Pop, 80s ... NaN MPEG audio file NaN Track 4 40.0 4/10/2018 7:19 PM 1.0 1/14/2018 10:37 AM 80 C:\\Users\\Lee\\itunes\\Music\\Depeche Mode\\Best Of...\n", "127 Enjoy The Silence Depeche Mode Martin L. Gore Violator NaN NaN NaN NaN NaN Pop, 80s, Alternative ... NaN MPEG audio file NaN Sire / Reprise 9 26081-2 27.0 4/9/2018 7:49 PM 2.0 2/23/2018 1:33 PM 60 C:\\Users\\Lee\\itunes\\Music\\Depeche Mode\\Violato...\n", "29 The Final Countdown Europe NaN 101 Sporting Anthems NaN NaN NaN NaN NaN Hard Rock, 80s ... NaN MPEG audio file NaN NaN 86.0 4/29/2016 3:29 PM 5.0 9/28/2015 7:58 PM 80 C:\\Users\\Lee\\itunes\\Music\\Various Artists\\101 ...\n", "358 The Final Countdown Europe NaN The Best 80s Songs NaN . NaN NaN NaN 80s, Pop ... NaN MPEG audio file NaN NaN 3.0 1/2/2018 5:47 PM 2.0 1/6/2018 9:49 PM 40 C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Bes...\n", "36 Down Under Men At Work NaN The Best 80s Songs NaN . NaN NaN NaN 80s, Pop ... NaN MPEG audio file NaN NaN 2.0 1/2/2018 7:24 PM 1.0 1/2/2018 1:44 AM 80 C:\\Users\\Lee\\itunes\\Music\\Compilations\\The Bes...\n", "37 Down Under Men At Work Colin Hay, Roy Strykert Like, Omigod! The 80s Pop Culture Box (Totally... NaN NaN NaN NaN NaN Pop, 80s ... NaN MPEG audio file NaN NaN 49.0 4/11/2018 6:47 PM 3.0 1/17/2018 6:53 PM 80 C:\\Users\\Lee\\itunes\\Music\\Men At Work\\Like, Om...\n", "45 Cum On Feel The Noize Quiet Riot NaN 101 Sporting Anthems NaN NaN NaN NaN NaN Hard Rock, 80s ... NaN MPEG audio file NaN NaN 79.0 4/29/2016 4:44 PM 5.0 1/8/2018 3:01 PM 80 C:\\Users\\Lee\\itunes\\Music\\Various Artists\\101 ...\n", "485 Cum On Feel The Noize Quiet Riot Noddy Holder, Jim Lea Like, Omigod! The 80s Pop Culture Box (Totally... NaN NaN NaN NaN NaN Pop, 80s ... NaN MPEG audio file NaN NaN 3.0 3/30/2018 10:27 AM 2.0 1/9/2018 6:04 PM 40 C:\\Users\\Lee\\itunes\\Music\\Compilations\\Like, O...\n", "\n", "[10 rows x 31 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "df_dupes1 = df_playlist1[df_playlist1.duplicated(['Artist','Name'], keep=False)]\n", "log.info(\"df_dupes1.shape=%s\" % str(df_dupes1.shape))\n", "\n", "df_dupes1.sort_values(['Artist','Name']).head(10)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There 76 duplicate entries in the set based on the composite key we used.\n", "\n", "Looking at the data, it seems it contains information for the actual digital files stored on disk.\n", "\n", "For our purpose of comparing the logical music tracks between the 2 playlists, we can safely remove that physical file related information to form a new logical representation of the playlist based on a subset of the data.\n", "\n", "For now, we will simply strip out the duplicate records from the playlist based on original composite key and keep the first record in each case.\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:30:56,282 - INFO - df_playlist1_uniq.shape=(634, 31)\n" ] } ], "source": [ "df_playlist1_uniq=df_playlist1.drop_duplicates(subset=['Artist', 'Name'])\n", "\n", "log.info(\"df_playlist1_uniq.shape=%s\" % str(df_playlist1_uniq.shape))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enhance the dataset load to exclude duplicates\n", "\n", "We now update the original load logic to exclude duplicates and rerun the merge to see if the set union set count agrees with the merged dataset count.\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2018-04-24 15:31:58,614 - INFO - ## data/Top80_sPop&Rock.txt: df.shape=(634, 31)\n", "2018-04-24 15:31:58,625 - INFO - ## data/TopAlt&Electronic.txt: df.shape=(185, 31)\n", "2018-04-24 15:31:58,650 - INFO - inner_df.shape=(67, 60)\n", "2018-04-24 15:31:58,654 - INFO - merged_df.shape=(752, 60)\n", "2018-04-24 15:31:58,655 - INFO - Rows in dataset1=634\n", "2018-04-24 15:31:58,657 - INFO - Rows in dataset2=185\n", "2018-04-24 15:31:58,658 - INFO - Rows in intersection=67\n", "2018-04-24 15:31:58,661 - INFO - Rows in union=752\n", "2018-04-24 15:31:58,664 - INFO - (Rows in union==Rows in merged) = True\n" ] } ], "source": [ "\n", "all_files = ['data/Top80_sPop&Rock.txt','data/TopAlt&Electronic.txt']\n", "list_of_playlists= []\n", "for file_ in all_files:\n", " doc = codecs.open(file_,'rU','UTF-16') \n", " df = pd.read_csv(doc,index_col=None, header=0,sep='\\t')\n", " \n", " df.drop_duplicates(subset=['Artist', 'Name'], inplace=True)\n", " log.info(\"## %s: df.shape=%s\" % (file_, str(df.shape)))\n", "\n", " list_of_playlists.append(df)\n", "\n", "inner_df = reduce(lambda left,right: pd.merge(left,right,on=['Artist','Name'], how='inner'), list_of_playlists)\n", "merged_df = reduce(lambda left,right: pd.merge(left,right,on=['Artist','Name'], how='outer'), list_of_playlists)\n", "\n", "log.info(\"inner_df.shape=%s\" % str(inner_df.shape))\n", "log.info(\"merged_df.shape=%s\" % str(merged_df.shape))\n", "\n", "df1_cnt=list_of_playlists[0].shape[0]\n", "df2_cnt=list_of_playlists[1].shape[0]\n", "df_inner_cnt=inner_df.shape[0]\n", "df_union_cnt=df1_cnt+df2_cnt-df_inner_cnt\n", "df_merged_cnt=merged_df.shape[0]\n", "\n", "log.info(\"Rows in dataset1=%s\" % df1_cnt)\n", "log.info(\"Rows in dataset2=%s\" % df2_cnt)\n", "log.info(\"Rows in intersection=%s\" % df_inner_cnt)\n", "log.info(\"Rows in union=%s\" % df_union_cnt)\n", "log.info(\"(Rows in union==Rows in merged) = %s\" % (df_union_cnt==df_merged_cnt))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the data set union count agrees with the merged dataframe count.\n", "\n", "For an example using pandas to load multiple excel datasets/sheets into a single dataframe, see [here](http://leeblog.org/blog/2018/04/24/merging-excel-sheets-using-pandas/).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }