Statistics – Matt Randall

Methods for Missing Data

Matt Randall — Fri, 21 Feb 2020 22:37:21 +0000

It is common when collecting data for some entries to be absent. This can have a significant impact on any attempt to gain useful information from these data, hence methods have been developed in order to make it possible to gain useful insights into data of this kind. A simple method for this is to simply discard any record which contains a missing entry, however this can lead to such a small sample that it is not useful for obtaining reliable information. In addition to this, there may be reasons why certain groups of people do not want to supply certain information, hence this approach can result in those certain groups of people being ignored.

In order to use data with absent entries, these absent entries are often filled in, this is called imputing them. There are a variety of different methods which can be used for this. Some of which are explained below.

Unconditional mean imputation: The simplest method is to take the average value of a variable which has missing entries, and use this as the value for all those which are missing. Whilst convenient, this can lead to distortions in the data.
Conditional mean imputation: Unconditional mean imputation can be improved upon by identifying a variable which seems to have a connection with the one with missing values and group the records according to this variable. The average value within each group for the variable with missing values is then calculated and used to fill in the missing values their respective group. Distortions in the data are still present here.
Regression imputation:��This method involves identifying a variable which has a connection to the one with missing values, and effectively plotting them and calculating a line of best fit for their relationship. This line is then used to predict missing values. Distortions in the data are still present here.
Stochastic regression imputation: This method involves performing regression imputation as mentioned above, but moving every imputed value by a random amount. This is intended to reflect the randomness in the data and prevent the previously mentioned distortion.

In order to reflect that there is some uncertainty in imputation, when using a method with some randomness to it (such as stochastic regression imputation) it can be useful to perform regression multiple times to gain multiple data sets. These data sets are then studied separately, and the averages of these are found with an estimate of how uncertain these averages are. This is called multiple imputation and can be useful because it gives an idea of how accurate the method used is.

When studying data, the selection of which variables to study is important. There are well established methods for this, however with missing data things are not quite as straight forward. Methods for dealing with this range from simply performing the standard method on the imputed data to altering the chances of variables being selected based on how much of them are missing.

Overall this is a wide area with a range of methods associated with it, only a few of which have been mentioned here. It is important to keep researching in this area in order to make collected data as useful as possible.

Statistics and Music: Studying Beethoven

Matt Randall — Tue, 04 Feb 2020 13:10:28 +0000

While I was on the internet earlier today, I stumbled across a short but interesting video involving a novel use of statistics. Here, statistical methods have been used in order to determine what makes Beethoven’s string quartets sound distinctively like his compositions. I thought I would share it on here, you may view the video below.

SlipKnot Setlists: Predictions vs Results

Matt Randall — Mon, 20 Jan 2020 14:05:12 +0000

Recently I took a brief look at a set of data containing information about the songs played by SlipKnot at their past performances and used this to predict the set they would play when I went to see them on the 16th of January, which can be found here. The actual set played can be seen to the left below, while the figure to the right below shows how the likelihood the twenty most likely songs to be performed, with the predicted set list being the first seventeen songs in this figure (“All Out Life” to “The Blister Exists”).

Set Played:

Unsainted
Disasterpiece
Eeyore
Nero Forte
Before I Forget
New Abortion
Psychosocial
Solway Firth
Vermilion
Birth of the Cruel
Wait and Bleed
Eyeless
All Out Life
Duality
(sic)
People=Shit
Surfacing

Of the seventeen songs predicted, twelve were performed at the concert. The songs performed which were not predicted include “Birth of the Cruel” and “Nero Forte”, which as mentioned in the previous post were likely to be played despite having only been performed once before due to being new additions to the band’s live shows. In addition to this “Vermilion” was performed, which is a fairly frequently played song predicted to be the 22nd most likely from my analysis and is the overall 16th most played song. Other songs performed which did not appear likely to be played based on the data were “Eeyore”, a hidden track from the self-titled debut and “New Abortion” from “Iowa”, both of with are songs noted for their heaviness and intensity.

Of the songs predicted which were not performed, two were from the fifth album “.5: The Gray Chapter” (“Custer” and “The Devil In I”), which is notable as there was no representation from this album at all in the performance. Other absent songs which seemed likely were “The Blister Exists”, one of my personal favourites from “Vol. 3: (The Subliminal Verses)” and “The Heretic Anthem”, a fan favourite from the second album. The biggest surprise absence however was “Spit It Out”, a song from the debut album which is considered by many to be an essential part of the band’s live show, see (warning, explicit language).

��

Looking for the most frequently played songs seems to be a relatively accurate way of predicting which songs will be played, and may be appropriate at the beginning of a tour which is not in support of a new album. It is worth noting however that the setlist performed was identical to the one performed at the previous show in Dublin, so predicting an identical set to the most recent gig may be the most accurate method for predicting songs for a concert in a tour which has already commenced.

��

SlipKnot Setlists

Matt Randall — Thu, 16 Jan 2020 11:53:45 +0000

Today, I get the opportunity to go to Manchester in order to attend a concert. This is something I am very excited about and as it was all I can think about I decided to write a blog post about it. The website keeps up to date information on setlists played by bands at gigs, this was used as a to find information on how many times each song has been performed by SlipKnot, with data having been retrieved on the 15th of January. For the purpose of this study, instrumental solo spots have been removed, non album singles have been attributed to the album which its release date is closest to, songs from demos released before the band were signed to a record label were also removed as data from this era was not well recorded and has been considered incomplete.

Plainly from looking at the data, it can easily be seen which are the most popular songs to perform, displayed in the plot to the left. It is not surprising that the four most played songs are all the the band’s self-titled debut album, as songs have been around for a longer period of time and many of which are considered live staples, meanwhile further down the list some of the more well known songs from later albums such as “The Devil in I”, “Psychosocial” and “Dead Memories” can be seen further down in the top twenty, mixed in with songs from the first three albums.

When looking at total the number of times songs have been performed from each album, a bias towards earlier material in this data becomes even more obvious, hence simply taking the most played songs and trying to use these to predict a set list may not be representative, especially considering that the current tour is in support of the most recent album “We Are Not Your Kind” from 2019, so it would be expected that a lot of this newer material will be performed. In particular, two songs from the new album “Birth of the Cruel” and “Nero Forte” have only been performed once at the time of retrieving this data, and that was the evening before the data was retrieved, so these songs are considerably more likely to be played than the data suggests.

In order to gain more representative data for what a contemporary SlipKnot setlist would look like, the number of plays for each song was divided by the number of years since its release to give the average number of performances per year in order to make the data more fair, the results of which can be seen below. The most performed songs per year are “Unsainted” and “All out life”, which are a song from the most recent album and a standalone single released in the lead up to it, neither of which appeared on the top twenty songs played. Following this are three songs from the first album, showing that there is certainly some bias in the setlist choices towards old fan favourites. The sixth to twentieth spots are now occupied by a relatively even mix of songs from all albums. When looking at the total number of times a song has been played from each album using the modified data there still seems to be a clear preference towards songs from the self-titled debut album, and also a clear dislike towards performing songs from the fourth album “All Hope Is Gone”. Given that past SlipKnot setlists tend to consist of approximately seventeen songs, taking the top seventeen songs (from “All Out Life” to “The Blister Exists”) from the lower left hand figure may give some idea of a likely setlist to play. The results of this predicted setlist will be compared to the actually performed set in my next blog post, to be posted after the concert.

STOR-i Annual Conference

Matt Randall — Tue, 14 Jan 2020 16:32:28 +0000

On the 9th and 10th of January I got the opportunity to attend my first conference: the STOR-i annual conference. This served as an excellent opportunity to explore a range of contemporary research topics and start thinking about research areas which I may be interested in investigating further. Topics presented included: optimisation of air traffic scheduling (Alex Jacquillat, MIT), Bayesian optimisation (Henry Moss, STOR-i �첥��Ƶ), multi-armed bandits (Ciara Pike-Burke, Universitat Pompeu Fabra), time series (Richard Davis, Columbia University), machine learning (Tom Flowerdew, Featurespace) and credit risk assessment (Veronica Viniciotti, Brunel University).

Of particular interested for me was the presentation by STOR-i Alumnus Ciara Pike-Burke, now of Universitat Pompeu Fabra in Barcelona, titled “Multi-armed bandit problems with history dependent rewards”. Multi-armed bandits are an area of much research interest at �첥��Ƶ. They get there name from a analogy with an traditional one-armed bandit gambling machine (pictured to the right). A one armed bandit is a game which gives random rewards to the player based purely on some unknown probability distribution, they get there name from the large metal arm which was to be pulled on earlier machines in order to commence the game. The concept of a multi-armed bandit is similar, except rather then there being a single arm to pull, there are multiple arms each with a different unknown distribution of possible outcomes, representing a number of different options that may be taken in a given scenario with unknown reward. Hence multi-armed bandit problems tend to deal with the problem of choosing which arm to pull so as to maximise reward, given that the only way to gain information about the distributions of reward from each arm is by pulling them.

Normally, multi-armed bandit problems assume that the reward of each action is independent of all past rewards, however this is not always the case. The example given to illustrate this in the presentation is for the case of web advertisement, here an advertiser may suggest a product to a user, and there is some chance that the user will buy the product; however for certain products if a user has already purchased that product (that is the advertiser has “won” the game and been rewarded), the same user may be unlikely to purchase another one, for example if a user has bought a new oven, they are unlikely to want another. Thus, the reward distribution for suggesting that this particular user buys an oven (pulling the “suggest oven arm”) has changed. On the other hand it may be that the reward distribution for suggesting cookery equipment has changed and hence this is the “arm” which the advertiser should be pulling instead. The research being done by Ciara involves developing statistical methods to determine which arms have history dependent rewards and which do not.