The datasets for the #66daysofdata challenge
The core of the #66daysofdata with KNIME project draws on three Spotify datasets freely available on Kaggle (sign in to download them). As the Kaggle descriptions don't provide too much information about the different columns - check out this brief overview.
The tracks.csv dataset contains about 600k tracks from the period 1900-2021 and is described by 20 columns
id | track unique ID |
name | track name |
duration_ms | duration of song in milliseconds |
explicit | describes the content type of a track. Expicit content is represented by 1 and unexplicit by 0 |
artists | artist name |
id_artists | artist unique ID (collection) |
release_date | date when track was released |
danceability | describes how suitable a track is for dancing. Values range from 0.0 (least danceable= to 1.0 (most danceable) |
energy | represents a perceptual measure of intensity and activity. Values range from 0.0 (least energetic) to 1.0 (most energetic) |
key | the estimated overall key of the track e.g., 0 = C, 1 = C♯/D♭, 2 = D, etc. |
loudness | the overall loudness of a track in decibels |
mode | indicates the modality (major or minor) of a track. Major is represented by 1 and minor is 0 |
speechiness | detects the presence of spoken words in a track. Values range from 0.0 (least speechy) to 1.0 (most speechy) |
acousticness | a measure of whether the track is acoustic. Values range from 0.0 (least acoustic) to 1.0 (most acoustic) |
instrumentalness | predicts whether a track contains no vocals. Values range from 0.0 (least instrumental) to 1.0 (most instrumental= |
liveness | detects the presence of an audience in the recording. Values range from 0.0 (least live) to 1.0 (most live) |
valence | describes the musical postiveness/negativeness conveyed by a track. Values range from 0.0 (least positive) to 1.0 (most positive) |
tempo | the overall estimated tempo of a track in beats per minute (BPM) |
time_signature | tells how the music is to be counted |
The artist-uris.csv dataset contains data on roughly 81k artists and is described by 2 columns (header names are not provided)
[id_artists] | artist unique ID |
[artists] | artist name |
The artist.csv dataset is very similar to the tracks.csv dataset but also includes a popularity metric for the artists.
popularity | the popularity of an artist. Values range from 0 (least popular= to 100 (most popular) |
P.S. What is the #66DaysOfData Challenge?
The idea is to spend around 5-10 minutes on a specific data science project each day for 66 days and share your progress on your favorite social media platform with #66daysofdata. Ken Jee is the original instigator of #66daysofdata. Why 66 days? Because that's the average time it takes us to get practiced at doing something. In this case, data science with KNIME. Find the full roadmap here.