Eric Fields; December 20, 2020; CMSC320; Univeristy of Maryland
Apart from being graded for this, I actually am very personally interested in the results as an amateur songwriter. Songwriting is oftentimes formulaic in some respects, particularly in terms of the overall structure. There are in theory infinitely many structures that a song can have even from 2 different substructures like Verse and Chorus (V-C, C-V, C-C-C, etc.) With that being said, only a relatively small proportion of these structures are observed even with as many as 6 or 7 general structural components to choose from. With that being said, it is of the best interest of a beginner songwriter to select song structures that are common within their desired genre, but to also have a variety of structures so to add variety to their musical repetoire. If you don't take my word for it, do a simple Google search of "importance of song structure." Here is an example of one that may pop up:
Even once a general song structure has been identified, it is important to also work within the constraints of that particular genre/structure's norms with respect to how many words and lines are in that structure. If a song's verse has 100 lines before it reaches a 4 line chorus, it will likely not be successful. It may also be the case that there are differences between having something like 3 or 4 lines in a verse (I'll tell you now that there is, which is a product of music typically liking even numbered things). With that being said, it is also important for beginner songwriters to have an idea of how many lines their songs should have in each component.
For the non-songwriter, their interest in this work may be in the final section where there is a comparative analysis between the three genres with respect to various statistics describing song structure features, such as number of words/lines or average length of words. You may find that the genres are different in ways that you might expect, but also different in some ways you might not expect (which do you think will have higher mean word length, rock or hip hop?). With these differences, the code also provides an example of machine learning to predict structure classifications (pop verse, rock bridge, hip hop chorus, etc.) based on statistics that describe that structure that has pretty high success for particular structures and very low success for others.
import requests
from bs4 import BeautifulSoup
import os
import os.path
from os import path
import re
import string
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pprint
from scipy.stats import t
from scipy import stats
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn import svm
The first section of the tutorial will acquire genius song lyrics data from selected artists in differenet genres. Genius returns the songs in order of popularity on their website, which may be different from how the songs performed on Billboard charts. The number of times a song gets searched on Genius is likely highly correlated with general song popularity, which is likely highly correlated with Billboard rankings along with personal fanbase preferences. 30 Songs from each artist are identified as well, so it is very likely that the artist's big hits will be within the top 30 search on Genius along with some fan favorites. With that being said, using Genius' own popularity rankings are likely a fairly good representation of a particular artist's notable repetoire.
The website is linked here where top songs any artist can be searched: https://genius.com/
The following 4 functions are slightly modified versions of the functions found at the below website that gives an introduction to scraping genius web lyrics. The modifications I made change how they store the songs (i.e. storing song structure and appending titles). These functions will be used to store the song lyrics for selected artists in a local directory for further analysis. https://medium.com/analytics-vidhya/how-to-scrape-song-lyrics-a-gentle-python-tutorial-5b1d4ab351d2
# Get artist object from Genius API
GENIUS_API_TOKEN = 'TifFNgZ-aWrI7msPryTVsi1R0P6mWHKzbuZ0S38psQfuZp-0XLRskXbg8Femhz7n'
# This function returns a response that is parseable into json for a given artist
def request_artist_info(artist_name, page):
base_url = 'https://api.genius.com'
headers = {'Authorization': 'Bearer ' + GENIUS_API_TOKEN}
search_url = base_url + '/search?per_page=10&page=' + str(page)
data = {'q': artist_name}
response = requests.get(search_url, data=data, headers=headers)
return response
# Get Genius.com song url's from artist object. It gets the number of songs in song_cap if there are that many
# availabkle. Note this also includes songs where the artist is only featured
def request_song_url(artist_name, song_cap):
page = 1
songs = []
while True:
response = request_artist_info(artist_name, page)
json = response.json()
# Collect up to song_cap song objects from artist
song_info = []
for hit in json['response']['hits']:
if artist_name.lower() in hit['result']['primary_artist']['name'].lower():
song_info.append(hit)
# Collect song URL's from song objects
for song in song_info:
if (len(songs) < song_cap):
url = song['result']['url']
songs.append(url)
if (len(songs) == song_cap):
break
else:
page += 1
print('Found {} songs by {}'.format(len(songs), artist_name))
return songs
# Scrape lyrics from a Genius.com song URL for a particular song
def scrape_song_lyrics(url):
page = requests.get(url)
html = BeautifulSoup(page.text, 'html.parser')
title = '<TITLE>' + url.split('/')[3].replace('-lyrics','\n')
lyrics = title + html.find('div', class_='lyrics').get_text()
lyrics = os.linesep.join([s for s in lyrics.splitlines() if s])
return (lyrics)
# writes all of the lyrics found for an artist into a file. The lyrics directory must be premade, but the files do not
def write_lyrics_to_file(artist_name, song_count):
f = open('lyrics/' + artist_name.lower().replace(" ", "") + '.txt', 'wb')
urls = request_song_url(artist_name, song_count)
for url in urls:
# Appends <END> at the end of every song for later parseing
lyrics = scrape_song_lyrics(url) + '<END>'
f.write(lyrics.encode("utf8"))
f.close()
num_lines = sum(1 for line in open('lyrics/' + artist_name.lower().replace(" ", "") + '.txt', 'rb'))
print('Wrote {} lines to file from {} songs'.format(num_lines, song_count))
The Below are the artists that I will be using in my analysis. These artists were subjectively selected by me using a combination of personal preference and perceived popularity. There are 7 artists in each category and analysis is typically performed combining male and female categories of the same genre.
artists = {'fem_pop':['Ariana Grande','Katy Perry','Lady Gaga','Taylor Swift','Britney Spears',\
'Kesha','Kelly Clarkson'],\
'male_pop':['Justin Bieber','Shawn Mendes','Ed Sheeran','One Direction','OneRepublic',\
'Maroon 5','Justin Timberlake'],\
'male_hiphop':['Drake','Kanye West','Lil Wayne','DaBaby','J. Cole','Kendrick Lamar','Eminem'],\
'fem_hiphop':['Nicki Minaj','Doja Cat','Azealia Banks','cupcakKe','Megan Thee Stallion','Cardi B',\
'Iggy Azalea'],\
'male_rock':['The Beatles','Queen','Pink Floyd','Nirvana','Red Hot Chili Peppers',\
'Pearl Jam','Catfish and the Bottlemen']}
The create_path function will be used to retrieve the path name for a given artist's lyrics file. For example, if you want the path of Nicki Minaj, the function will return "lyrics/nickiminaj.txt" which can then be read and worked with as a file.
The piece of code after the function is used to scrape the lyrics for all of the artists in the dictionary of artists made in the cell before. It uses the predefined functions for scraping the genius URL's and scrapes at most 30 songs from genius. This entire process can take around 10 minutes when done from scratch, but if there already exits a .txt file corresponding to the artist's name (as in you already have their lyrics), it will not scrape again and will print a statement letting you know how many lines were scraped.
def create_path(name):
return str('lyrics/'+name.lower().replace(" ", "") +'.txt')
for names in artists.values():
for name in names:
name_path = create_path(name)
if path.exists(name_path):
f = open(create_path(name))
count = len(f.readlines( ))
print(name,'has',count,'lines in the .txt file')
f.close()
else:
write_lyrics_to_file(name,30)
This section defines 3 classes and many functions pertaining to these classes that will be used for storing an artists songs and each song's general structure. Examples of structure objects are things like "Verse" or "Chorus" and each song is made up of a variety of differnet substructures, giving each song an overall general structure. An example of a general structure for a song would be something like "Verse Chorus Verse Chorus Outro." The classes have functionality to calculate a number of statistics about a particular structure, such as the number of words, number of lines, number of repeat words/lines, and other metrics. One thing to note is that not all of the methods or data stored are used in this tutorial (such as whether or not a verse is a feature), but could instead be used for future analysis.
For more information about specific song structures, visit this genius link: https://genius.com/Genius-song-parts-annotated
The Artist class is made to store an artist's name and their songs. Objects of this class are instantiated using their name. It is important that the name is spelled correctly or the artists songs will not get added because the parse_text method does not add songs to an Artist object if the artist in question was not the main artist (i.e. was featured) and determines whether or not the artist was the main artist or feature artist by matching between the artist name and the first name of the genius url link after the .com, which is stored in the first line (and includes the title) of each individual song (as separated by '\<TITLE>' and \<END>"
class Artist:
def __init__(self, name, category):
self.name = name
self.songs = []
self.category = category
# This method will store all of the songs of an artist in a list
# This method should called with the raw text that is stored in the files directly from genius scraping
def parse_text(self, text):
for song_text in text.split('<TITLE>')[1:]: # the first one is always an empty string
# This checks to see if the artist was actually just featured on the song, in
# which case it is exluded from their data
length = len(self.name)
title_artist = song_text[0:length].lower()
# the strip at the end is for J. Cole because the URL did not include the '.'
first_name = self.name[0:self.name.find(' ')].lower().strip('\.')
# The first if statement makes sure the artist did not simply feature on the song, which is determined
# by the first name on the URL, which is stored in the first line of each song's txt within the file
if title_artist[0:len(first_name)] == first_name:
song = Song(self.name)
song.parse_song(song_text.replace('<END>','\n'))
# makes sure the song has structural components
if not song.get_gen_struct() == '':
self.songs.append(song)
def num_songs(self):
return len(self.songs)
# This returns all of the lines of every song as a list
def get_lines(self):
lines = []
for song in self.songs:
for struct in song.structure:
for line in struct.lines:
lines.append(line)
return lines
The song class creates Song objects that store the name of the artist, the title of the song and the overal structure of the song, which uses structure objects that denote the type of structure and store all of the lines of that part of the song
# this pattern will match anything that has 1 or more characters within square brackets
# which includes all of the structural components of the songs like [Chorus] or [Bridge]
struct_regex = re.compile('\[(.+)\]')
class Song:
def __init__(self, artist_name):
self.artist = artist_name
self.title = ''
self.structure = []
# This method will store all of the song's structure objects in a list. Structure objects include the lyrics for
# that structure
# This method should called with the text of a single song with the title from the URL as the first line
def parse_song(self, song_text):
# All structures genius returns are stored in brackets like [Chorus]
body = re.split('\[', song_text)
# Title includes artist name as well and is separated by '-' instead of spaces (it comes from the URL)
self.title = body[0].split('\n')[0]
# raw_struct are the lines beginning with something like [Chorus]\nFirst lines\nNext line\nso on
for raw_struct in body[1:]:
clean_struct = Struct()
clean_struct.parse_struct(raw_struct, self.artist)
if not len(clean_struct.lines) == 0:
self.structure.append(clean_struct)
# Does not include the following structures: Other, Outro, Refrain. Only want general structure
# returns a string separated by spaces
def get_gen_struct(self):
type_list = []
for struct in self.structure:
s_type = struct.type
if not (s_type == 'Other' or s_type == 'Refrain' or s_type == 'Outro'):
type_list.append(struct.type)
return " ".join(type_list)
# returns the lines of a song as a list of strings
def get_lines(self):
lines = []
for struct in self.structure:
for line in struct.lines:
lines.append(line)
return lines
# returns only the lines that are not repeated
def get_unique_lines(self):
lines = self.get_lines()
return list(np.unique(lines))
This class creates Struct objects that store the type of song structure component, such as verse or chorus, and also stores whether or not that component was written by a featured artist rather than the original artist of the song. This class addiiontally stores all of the lines for its corresponding part in the song and has methods for calculating a number of statistics pertaining to the structure.
struct_types = ['Intro','Verse','Chorus','Bridge','Pre-Chorus','Post-Chorus','Outro',\
'Refrain','Instrumental','Solo','Other']
class Struct:
def __init__(self):
self.type = 'Other'
self.feature = False
self.lines = []
# This should be called
def parse_struct(self, struct_text, artist_name):
split_text = struct_text.split('\n')
# split_text[0] = something like 'Chorus 1: (...)]'
type_text = split_text[0]
# This corrects spelling mistake of "Chrous" that was found in several files
if 'Chrous' in type_text:
type_text = type_text.replace('Chrous','Chorus')
# Genius updated its terminology at one point to make "hook" -> "chorus"
if 'Hook' in type_text:
type_text = type_text.replace('Hook','Chorus')
# checks to see if first word within the bracketed structure in the text file is one of the predefined
# structure types. If not, the type is defauled to "Other"
for s_type in struct_types:
if s_type in type_text:
self.type = s_type
# Colors in the structure name indicate a feature, but can also indicate a duet. In instances where it is
# a duet, the structure is not stored as a feature
if ':' in type_text:
# will make Verse 1: Ariana Grande & Britney spears -> ' Ariana Grande & ...'
cut_line = type_text[type_text.find(':')+1:]
# This determines if the artist's name comes first, which could mean it is a duet or something where the
# artist is still a main vocalist (and therefore not being considered a feature structure)
if artist_name in cut_line:
# if the artist's name is in the line and first, this will be true. If not, it will be false
self.feature = len(cut_line) > len(artist_name) + 2
else:
self.feature = True
self.lines = split_text[1:-1] # skips the type and the last new line
# This returns words cleaned to remove any punctuation and capitalization to be used for determining uniqueness
def get_clean_words(self):
words = []
for line in self.lines:
for word in line.split():
clean_word = word.translate(str.maketrans('', '', string.punctuation)).lower()
words.append(clean_word)
return words
# number of lines in the structure
def get_num_lines(self):
return len(self.lines)
# number of words in the structure
def get_num_words(self):
word_count = 0
for line in self.lines:
word_count += len(line.split())
return word_count
# number of repeated words in the structure
def get_num_repeat_words(self):
words = self.get_clean_words()
return len(words) - len(list(np.unique(words)))
# number of repeated lines in the structure
def get_num_repeat_lines(self):
return self.get_num_lines() - len(list(np.unique(self.lines)))
# returns the average words per line in the structure
def get_wpl(self):
return self.get_num_words()/self.get_num_lines()
# returns the standard deviation of # words across lines
def get_std_wpl(self):
words = []
for line in self.lines:
words.append(len(line.split()))
return np.std(words)
# returns the mean length of words in the structure
def get_mean_word_len(self):
words = self.get_clean_words()
w_lengths = []
for word in words:
w_lengths.append(len(word))
return np.mean(w_lengths)
# returns the standard deviatoin of word length across the structure
def get_std_word_len(self):
words = self.get_clean_words()
w_lengths = []
for word in words:
w_lengths.append(len(word))
return np.std(w_lengths)
# returns a list of statistics describing the structure which is later used for machine learning classification
def get_full_analysis(self):
return [self.get_num_words(),\
self.get_num_lines(),\
self.get_num_repeat_words(),\
self.get_num_repeat_lines(),\
self.get_std_wpl(),\
self.get_mean_word_len(),\
self.get_std_word_len()]
This section will look into the most common general song structures in across the different genres (pop, hiphop, rock). It does this individually for each genre and visualizes both the top structures and a breakdown of the distribution of words and lines within each structure. The exact same analysis is performed for each of the three genres with predefined functions, so for the sake of the tutorial, you may just want to look at the analysis of whatever genre whose structural components you are interested in and then move to the 'Comparative Analysis' section. The individual analysis would be useful for songwriters who want to see how well their structures fit into the norms within the genre.
For information about why song structure is important for songwriting, see the link below: https://www.masterclass.com/articles/songwriting-101-learn-common-song-structures
A personal recommendation to beginner songwriters would be to make sure you begin writing songs that stick within the most common conventions of the genre of your choice, which are determined below. The link above provides some insight into why choosing a random structure is likely not in your best interest.
In the code these functions are used to determine common general structures within genres Note general structures were previously defined in the class section to omit structures that were stored as "Other","Outro", or "Refrain"
# this returns a list of all general structures from all artists within the list.
def get_all_structs(all_artists):
all_structs = []
for artist in all_artists:
for song in artist.songs:
all_structs.append(song.get_gen_struct())
return all_structs
# Returns a dictionary storing all unique general structures along mapped to how many times they appeared within the
# songs of the list of artists passed in
def get_struct_count(all_artists):
total_structs = get_all_structs(all_artists)
uniq_structs = np.unique(total_structs)
struct_count = {}
for struct in uniq_structs:
struct_count[struct] = 0
for struct in total_structs:
struct_count[struct] += 1
return struct_count
# Returns a dictionary of the structures that appeared at least threshold times within all of the songs of all artists
# passed in
def get_top_thresh_structs(all_artists,threshold):
s_counts = get_struct_count(all_artists)
top_thresh_structs = {}
for key,val in s_counts.items():
if val >= threshold:
top_thresh_structs[key] = val
return top_thresh_structs
This function generates a pie chart of all the general structures that had a frequency of at least 'threshold' of all songs of all artists passed in. The pie chart uses percentages that are relative to eachother in the threshold, not to all general structures. This is used to see the top structures which someone can select from for songwriting. The function also prints out the structures and their frequencies.
def create_top_thresh_structs_pie(all_artists, threshold):
t_structs = get_top_thresh_structs(all_artists, threshold)
print('The structures that were used at least', threshold,'times were the following:')
pprint.pprint(get_top_thresh_structs(all_artists,threshold))
struct_labels = {}
labeled_t_structs = {}
for i, struct in enumerate(t_structs.keys()):
struct_labels[struct] = i
labeled_t_structs[i] = t_structs[struct]
fig1, ax1 = plt.subplots()
ax1.pie(t_structs.values(), labels=t_structs.keys(), autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
The following functions return dictionaries whose keys are general structure types and whose values are lists of the desired statistic calculated for all target structures in all songs of all artists passed in. This is used to later to investigate difference between structural components on average with respect to wpl or lps
# returns dictionary with all target structures and a list of words per line in all of those structures found within
# songs within the artists passed in
def get_wpl(artists, target_structs):
wpl = {}
for s_type in target_structs:
wpl[s_type] = []
for artist in artists:
for song in artist.songs:
for struct in song.structure:
s_type = struct.type
if s_type in target_structs:
wpl[struct.type].append(struct.get_wpl())
return wpl
# returns dictionary with all target structures and a list of lines per structure in all of those structures found
# within songs within the artists passed in
def get_lps(artists, target_structs):
lps = {}
for s_type in target_structs:
lps[s_type] = []
for artist in artists:
for song in artist.songs:
for struct in song.structure:
s_type = struct.type
if s_type in target_structs:
lps[struct.type].append(len(struct.lines))
return lps
These two functions are used to visualize differneces between statistics of structures and to determine whether or not the mean value of those statistics differ between structures in a statistically significant way.
# Takes in dictionary mapping structures to a list of values of a particular statistic and generates a boxplot with
# each structure's distributions show
def create_boxplots(struct_dict):
fig, ax = plt.subplots()
ax.boxplot(struct_dict.values(), showfliers = False)
ax.set_xticklabels(struct_dict.keys())
# This function prints out a description of the results of analysis
# takes in a dictionary's mapping and a list of target_structures to analyze and does statistical mean t-tests for all
# pairwise of combinations of the structures in the dictionary. The p-value used for accepting or rejecting the
# null hypothesis of identical population means is 0.05
def mean_ttest(struct_dict, target_structs):
any_fails = False
for i in range(len(target_structs)):
for j in range(i+1, len(target_structs)):
struct1 = target_structs[i]
struct2 = target_structs[j]
pval = stats.ttest_ind(struct_dict[struct1],struct_dict[struct2]).pvalue
if pval >= 0.05:
print(struct1,'vs',struct2,'pval=',pval,'which is above 0.05 ->',\
'fail to reject the null hypothesis\n')
any_fails = True
if any_fails:
print('For all other combinations, the pvalue was less than 0.05, which is evidence to reject',\
'the null hypothesis, meaning the difference in mean value between them was statistically signifcant')
else:
print('For all combinations, the pvalue was less than 0.05, which means the',\
'the differnce between means for all categories was statistically signifcant.')
This code is similar to the create_boxplots function above, but creates histograms for each individual structure rather than boxplots for all of the structures on a single plot. This is useful to look and see how specific numbers of lines/words per song structure are more common than others (just using a sample mean or median would not tell you this information). This could be very consequential because the sample mean lines in of pop verses could be 7 but no songs actually have 7 lines per verse, only 6 or 8
def create_histplots(struct_dict):
for key, val in struct_dict.items():
plot_range = (1,min(np.max(val),18))
bins = range(1,plot_range[1]-plot_range[0]+1)
plt.hist(val, range = plot_range, bins = bins)
plt.xticks(bins)
plt.title(key)
plt.show()
The following code creates artist objects and parses their lyrics into songs/structures for all previously selected artists
artist_types = {'fem_pop':[],'male_pop':[],'male_hiphop':[],'fem_hiphop':[],'male_rock':[]}
for category in artists.keys():
names = artists[category]
for name in names:
path = create_path(name)
with open(path, 'r') as file:
text = file.read()
artist = Artist(name,category)
artist.parse_text(text)
file.close()
artist_types[category].append(artist)
This information would be useful to someone who wants to try writing songs in one of the selected genres. The general approach I would recommend to someone interest in this would be to first look at the pie chart to see what type of structure they would like to try using (if no experience before, may as well choose the most popular). From there try writing some lyrics, but make sure they stay within the ranges of lines per structure/words per structure as shown in the boxplots. To refine the lyrics, take a look at the histograms that show the distribution of # of lines in each structure and try to choose a common number of lines. For example, pop choruses tend to have either 4 or 8 lines as shown in the histogram. This may mean altering the lyrics to get a more desired structure, but this likely is a good thing. As some famous painter once said, a true artist should be able to draw grapes so realistic that birds try to eat them before diving off into the land of abstract. I don't know who said that, but I think it's a good thing to keep in mind
The following code performs analysis on pop artist structures to determine which is the most common using the previously described pie chart function as well as looking into the distribution of number of words/lines in each general structure type across pop songs. See the writing under the previous "Individual Analysis" section for how to go about using this information for songwriting
pop = artist_types['male_pop'] + artist_types['fem_pop']
threshold = 8
create_top_thresh_structs_pie(pop,threshold)
title = str('Structures Used At least '+str(threshold)+' Times in Pop (% freq rel to others within threshold)')
plt.title(title,pad= 25.5)
plt.show
target_structs = ['Intro','Chorus','Verse','Bridge','Pre-Chorus','Post-Chorus']
# words per line
pop_wpl = get_wpl(pop,target_structs)
# lines per structure
pop_lps = get_lps(pop,target_structs)
create_boxplots(pop_wpl)
plt.title('Distribution of Words Per Structure in Pop Songs')
plt.show()
mean_ttest(pop_wpl, target_structs)
create_boxplots(pop_lps)
plt.title('Distribution of Lines Per Structure in Pop Songs')
plt.show()
mean_ttest(pop_lps, target_structs)
The graphs below show that pop chrouses, verses, and bridges commonly have 4 or 8 lines while. the other structures (except for intro) most commonly have 4 lines. For intial songwriting, it would be adivsed to choose lines within these numbers
create_histplots(pop_lps)
The following code performs analysis on hip hop artist structures to determine which is the most common using the previously described pie chart function as well as looking into the distribution of number of words/lines in each general structure type across pop songs. See the writing under the previous "Individual Analysis" section for how to go about using this information for songwriting
hiphop = artist_types['fem_hiphop'] + artist_types['male_hiphop']
threshold = 8
create_top_thresh_structs_pie(hiphop,threshold)
title = str('Structures Used At least '+str(threshold)+' Times in Hip Hop (% rel to others in threshold)')
plt.title(title,pad= 25.5)
plt.show
target_structs = ['Intro','Chorus','Verse','Bridge','Pre-Chorus','Post-Chorus']
# words per line
hiphop_wpl = get_wpl(hiphop,target_structs)
# lines per structure
hiphop_lps = get_lps(hiphop,target_structs)
create_boxplots(hiphop_wpl)
plt.title('Distribution of Words Per Structure in Hip Hop Songs')
plt.show()
mean_ttest(hiphop_wpl, target_structs)
create_boxplots(hiphop_lps)
plt.title('Distribution of Lines Per Structure in Hip Hop Songs')
plt.show()
mean_ttest(hiphop_lps, target_structs)
The graphs below show that hip hop chrouses, bridges, pre-choruses, post-choruses typically have 4 or 8 lines while hip hop verses typically have 8, 12, or 16 lines.
For intial songwriting, it would be adivsed to choose lines within these numbers
create_histplots(hiphop_lps)
The following code performs analysis on male rock artist structures to determine which is the most common using the previously described pie chart function as well as looking into the distribution of number of words/lines in each general structure type across pop songs. See the writing under the previous "Individual Analysis" section for how to go about using this information for songwriting
male_rock = artist_types['male_rock']
threshold = 5
create_top_thresh_structs_pie(male_rock,threshold)
title = str('Structures Used At least '+str(threshold)+' Times in Male Rock (% rel to others in threshold)')
plt.title(title,pad= 25.5)
plt.show
target_structs = ['Intro','Chorus','Verse','Bridge','Pre-Chorus','Post-Chorus']
# words per line
male_rock_wpl = get_wpl(male_rock,target_structs)
# lines per structure
male_rock_lps = get_lps(male_rock,target_structs)
create_boxplots(male_rock_wpl)
plt.title('Distribution of Words Per Structure in Male Rock Songs')
plt.show()
mean_ttest(male_rock_wpl, target_structs)
create_boxplots(male_rock_lps)
plt.title('Distribution of Lines Per Structure in Male Rock Songs')
plt.show()
mean_ttest(male_rock_lps, target_structs)
The graphs below show that most rock structures have 4 lines in them (less commonly 8 in a verse). For intial songwriting, it would be adivsed to choose lines within these numbers
create_histplots(male_rock_lps)
The following code performs analysis to see how different structures in the different genres compare to each other. This information will likely not be paricularly useful for songwriting because it would be a better idea to look at an individual genre to model a structure off of than to look at a combination of differnet genres. As such, this information is more for understanding differences between genres with respect to lines and words in different structures.
This pie chart uses all artists in every genre and plots a pie chart of structures used at least 20 times. It seems that certain structures are conserved across genres to some extent
all_artists = []
for key in artist_types.keys():
all_artists = all_artists + artist_types[key]
threshold = 20
create_top_thresh_structs_pie(all_artists,threshold)
title = str('Structures Used At least '+str(threshold)+' Times in Any Genre (% freq rel to others within threshold)')
plt.title(title,pad= 25.5)
plt.show
The following computes a dataframe that will store all information pertaining to words and lines that structure methods calculate. This dataframe will be used to visualize how the different genres compare with respect to statistics pertaining words and lines.
target_structs = ['Intro','Chorus','Verse','Bridge','Pre-Chorus','Post-Chorus']
struct_data = []
for artist in all_artists:
for songs in artist.songs:
for struct in songs.structure:
if struct.type in target_structs:
# makes classification something liike hiphop_Verse
s_classification = artist.category[artist.category.find('_')+1:] + '_' + struct.type
struct_data.append(struct.get_full_analysis() + [s_classification])
struct_df = pd.DataFrame(struct_data, columns = ['words','lines','repeat words','repeat lines','std(words per line)',\
'mean(word length)','std(word length)','classification'])
struct_df
The following code groups the dataframe into verses and choruses and then a single dataframe that takes the mean value of all statistics that is outputted
verse_structs = ['hiphop_Verse','rock_Verse',\
'pop_Verse',]
chorus_structs = ['hiphop_Chorus','pop_Chorus','rock_Chorus']
# df that only stores verse structure information
verse_df = struct_df.loc[struct_df['classification'].isin(verse_structs)]
chorus_df = struct_df.loc[struct_df['classification'].isin(chorus_structs)]
# df that only stores chorus structure information
avg_grouped_df = struct_df.groupby('classification').mean().reset_index()
avg_verse_df = avg_grouped_df[avg_grouped_df['classification'].isin(verse_structs)]
avg_chorus_df = avg_grouped_df[avg_grouped_df['classification'].isin(chorus_structs)]
avg_grouped_df
The following plots analyze different statistics about the words and lines in songs across the genres and shows labels where different structures within a category appear on the chart. This can qualitatively be used to determine differences between genres. These differences are quantitatively analyzed using t-tests with a null hypothesis of idential population means
fig, ax = plt.subplots()
grouped = verse_df[['words','lines','classification']].groupby('classification')
colors = {'hiphop_Verse' : 'red','pop_Verse' : 'blue','rock_Verse' : 'green'}
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='words', y='lines', label=key, color = colors[key])
plt.legend()
plt.title('Lines vs Words for All Verse Types')
plt.show()
The above plot shows that there are signifcant differences between lines and words between hip hop verses and pop/rock verses. The graph below quantifies this showing that on average hip hop verses have about 160 words whereas pop and rock verses have about 50 and 45 respectively. The analysis also concludes that differences between all three genres are statistically signifcant, suggesting that the ordering for average number of words in a verse is hip hop >> pop > rock
values_dict = {'hiphop_Verse':[],'pop_Verse':[],'rock_Verse':[]}
for key in values_dict.keys():
values_dict[key] = np.array(verse_df.loc[verse_df['classification'] == key]['words'])
verse_means = [np.mean(values_dict['hiphop_Verse']),np.mean(values_dict['pop_Verse']),\
np.mean(values_dict['rock_Verse'])]
plt.bar(values_dict.keys(), verse_means, color = list(colors.values()))
plt.title('Mean Verse Number of Words')
plt.show()
mean_ttest(values_dict, list(values_dict.keys()))
fig, ax = plt.subplots()
grouped = verse_df[['mean(word length)','std(word length)','classification']].groupby('classification')
colors = {'hiphop_Verse' : 'red','pop_Verse' : 'blue','rock_Verse' : 'green'}
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='mean(word length)', y='std(word length)', label=key, color = colors[key])
plt.legend()
plt.title('Std Word Length vs Mean Word Length for All Verse Types')
plt.show()
The above plot qualitatively indicates that the genre verses are relatively similar with respect to word length in their structures, with maybe a few songs in rock having higher mean word length and some rap songs having higher standard deviation of word length (very few for both cases). The graphs below suggest that the only statistically significant difference between word lengths was that rock verses have higher word lengths on average, but the difference betwen averages is <0.1, so this distinction is not too meaningful subjectively.
values_dict = {'hiphop_Verse':[],'pop_Verse':[],'rock_Verse':[]}
for key in values_dict.keys():
values_dict[key] = np.array(verse_df.loc[verse_df['classification'] ==\
key]['mean(word length)'])
verse_means = [np.mean(values_dict['hiphop_Verse']),np.mean(values_dict['pop_Verse']),\
np.mean(values_dict['rock_Verse'])]
plt.bar(values_dict.keys(), verse_means,color = list(colors.values()))
plt.title('Mean Verse Word Length')
plt.show()
mean_ttest(values_dict, list(values_dict.keys()))
fig, ax = plt.subplots()
grouped = chorus_df[['words','lines','classification']].groupby('classification')
colors = {'hiphop_Chorus' : 'red','pop_Chorus' : 'blue','rock_Chorus' : 'green'}
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='words', y='lines', label=key, color = colors[key])
plt.legend()
plt.title('Lines vs Words for All Chorus Types')
plt.show()
The above plot suggests that there are notable differences between lines and words between hip hop verses and pop/rock choruses with hip hop and pop choruses having on average more lines/words than rock choruses. This is similar to the finding in individual analysis where rock choruses mostly only had 4 lines where pop and hip hop had 4 or 8 typically. The graph below quantifies this showing that on average hip hop chorus has twice as many words as the average rock chorus, and the average pop chorus has a little less than twice as much on average. It also shows that hip hop choruses on overage have more words than pop choruses. All of these differences were statistically signficant with a pvalue of 0.05.
values_dict = {'hiphop_Chorus':[],'pop_Chorus':[],'rock_Chorus':[]}
for key in values_dict.keys():
values_dict[key] = np.array(chorus_df.loc[chorus_df['classification'] == key]['words'])
chorus_means = [np.mean(values_dict['hiphop_Chorus']),np.mean(values_dict['pop_Chorus']),\
np.mean(values_dict['rock_Chorus'])]
plt.bar(values_dict.keys(), chorus_means, color = list(colors.values()))
plt.title('Mean Chorus Number of Words')
plt.show()
mean_ttest(values_dict, list(values_dict.keys()))
fig, ax = plt.subplots()
grouped = chorus_df[['mean(word length)','std(word length)','classification']].groupby('classification')
colors = {'hiphop_Chorus' : 'red','pop_Chorus' : 'blue','rock_Chorus' : 'green'}
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='mean(word length)', y='std(word length)', label=key, color = colors[key])
# plt.xticks(np.arange(min(train_mus),max(train_mus)+1,1))
plt.legend()
plt.title('Std Word Length vs Mean Word Length for All Chorus Types')
plt.show()
The above plot qualitatively indicates that the genre choruses are relatively similar with respect to word length in their structures, with maybe a few songs in rock having higher mean word length and some rap songs having higher standard deviation of word length (very few for both cases). The graphs below suggest that the only statistically significant difference between word lengths was that rock verses have higher word lengths on average, but the difference betwen averages is <0.1, so this distinction is not too meaningful subjectively.
values_dict = {'hiphop_Chorus':[],'pop_Chorus':[],'rock_Chorus':[]}
for key in values_dict.keys():
values_dict[key] = np.array(chorus_df.loc[chorus_df['classification'] ==\
key]['mean(word length)'])
chorus_means = [np.mean(values_dict['hiphop_Chorus']),np.mean(values_dict['pop_Chorus']),\
np.mean(values_dict['rock_Chorus'])]
plt.bar(values_dict.keys(), chorus_means, color = list(colors.values()))
plt.title('Mean Chorus Word Length')
plt.show()
mean_ttest(values_dict, list(values_dict.keys()))
The following code uses a SVM from the sklearn package to classify the different structures based on the features in the dataframe storing statistics for all general structures (shown again below). The classifications are mapped to numbers for the SVM and 5-Fold cross validation is performed to investigate performance. Following cross validation, the success at predicting each structure type is visualized.
struct_df
The following data encodes the classifications to ints and prings out the encoding dictionary. It also stores the data without classification as variable X and the corresponding classificatios in variable y.
encode_dict = {}
for i, classification in enumerate(np.unique(struct_df['classification'])):
encode_dict[classification] = i+1
decode_dict = {val: key for key, val in encode_dict.items()}
pprint.pprint(encode_dict)
X = struct_df.drop('classification', axis=1).to_numpy()
y = struct_df['classification'].apply(lambda c: encode_dict[c])
X.shape
The following code uses an sklearn svm with normalization hyperparameter, C, set to 1 with a linear kernel. It performs 5-fold cross validation with shuffling the data so that no sets of data have like app hip hop for example. It also keeps track of the accuracy score for each validation iteration along with specifics about which structures were correctly predicted vs which were not.
svm_clf = svm.SVC(kernel='linear', C=1)
# prepare cross validation
kfold = KFold(5, shuffle = True)
# stores the overall accuracy score for each validation
scores = []
# stores a list of whether or not a structure type was correctly classified with 0 being no, 1 being yes
struct_analysis = {key: [] for key in decode_dict.keys()}
# This splits the data 2 different ways and returns the indices of the splits for easy access
for train_index, test_index in kfold.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# svm prediction
svm_clf.fit(X_train, y_train)
svm_y_pred = svm_clf.predict(X_test)
expected = np.array(y_test)
predicted = np.array(svm_y_pred)
# This appends 1 if the expected structure was predicted, 0 if it was not
for i in range(len(expected)):
if expected[i] == predicted[i]:
struct_analysis[expected[i]].append(1)
else:
struct_analysis[expected[i]].append(0)
# appends accuracy score
scores.append(accuracy_score(y_test, svm_y_pred))
plt.boxplot(scores)
plt.title('Accuracy score of 5-fold Cross Validation with SVM')
plt.show()
print('For the SVM classifier, the mean accuracy score was:', np.mean(scores),
'and the standard error of the mean was:',stats.sem(scores))
The SVM did an okay job at classifying the different structures, but one could imagine that those structures that are more similar to eachother will be harder for it to predict correctly. The following code visualizes the success rate the SVM had at predicting each structure for each genre
for key in struct_analysis.keys():
results = struct_analysis[key]
struct_analysis[key] = np.sum(results)/len(results)
struct_analysis = {decode_dict[key]: val for key,val in struct_analysis.items()}
pop = {}
hiphop = {}
rock = {}
for key in struct_analysis.keys():
if 'hiphop' in key:
hiphop[key[key.find('_')+1:]] = struct_analysis[key]
else:
if 'pop' in key:
pop[key[key.find('_')+1:]] = struct_analysis[key]
else:
if 'rock' in key:
rock[key[key.find('_')+1:]] = struct_analysis[key]
# plots all of the data in 3 separate bar graphs with the same y-axis for better comparisons
plt.bar(pop.keys(),pop.values())
plt.title('SVM Classifier Pop Success Rate Cross Validation')
plt.yticks(np.arange(0, 1, 0.1))
plt.show()
plt.bar(hiphop.keys(),hiphop.values())
plt.title('SVM Classifier Hip Hop Success Rate Cross Validation')
plt.yticks(np.arange(0, 1, 0.1))
plt.show()
plt.bar(rock.keys(),rock.values())
plt.title('SVM Classifier Rock Success Rate Cross Validation')
plt.yticks(np.arange(0, 1, 0.1))
plt.show()
The bar graphs above suggest that the SVM does not classify rock structures well, which is likely affected by having lower number of rock artists in the samples (I did not know as many rock artists comparatively, so I didn't choose ones I did not know). Even with hip hop and pop, however, the svm was not able to distinguish between bridges and post-choruses. This is not that surprising because these features are structurally similar (from personal experience) to other structural components but differ mainly in locational placement within the song, which is not a sample feature.
The code was able to successfully identify the most popular song structures in the different genres as well as investigate the line and word composition of those structures. This information will be useful to songwriters who seek to structure their songs according to genre norms, which is recommended for beginning songwriters.
The code also saw noticeable structural differences across genres, with hip hop verses and choruses on average having more words. There were other smaller statistically signifcant differences, but it was also shown that the average word length was fairly similar across genres. I personally was expecting hip hop to have the highest average word length, but it was found that rock actually had a statistically significant higher average word length compared to hip hop and pop, even though it was very small.
Finally, the code was able to predict what structural component and genre of input structures based on a variety of statistics describing the structure. Different structures were able to be classified better than others with hip hop and pop structures being classified the most successfully.