#66DaysOfData Resources Datasets

August 1, 2021 — by Roberto Cadili

The datasets for the #66daysofdata challenge

The core of the #66daysofdata with KNIME project draws on three Spotify datasets freely available on Kaggle (sign in to download them). As the Kaggle descriptions don't provide too much information about the different columns - check out this brief overview.

The tracks.csv dataset contains about 600k tracks from the period 1900-2021 and is described by 20 columns

id track unique ID
name track name
duration_ms duration of song in milliseconds
explicit describes the content type of a track. Expicit content is represented by 1 and unexplicit by 0
artists artist name
id_artists artist unique ID (collection)
release_date date when track was released
danceability describes how suitable a track is for dancing. Values range from 0.0 (least danceable= to 1.0 (most danceable)
energy represents a perceptual measure of intensity and activity. Values range from 0.0 (least energetic) to 1.0 (most energetic)
key the estimated overall key of the track e.g., 0 = C, 1 = C♯/D♭, 2 = D, etc.
loudness the overall loudness of a track in decibels
mode indicates the modality (major or minor) of a track. Major is represented by 1 and minor is 0
speechiness detects the presence of spoken words in a track. Values range from 0.0 (least speechy) to 1.0 (most speechy)
acousticness a measure of whether the track is acoustic. Values range from 0.0 (least acoustic) to 1.0 (most acoustic)
instrumentalness predicts whether a track contains no vocals. Values range from 0.0 (least instrumental) to 1.0 (most instrumental=
liveness detects the presence of an audience in the recording. Values range from 0.0 (least live) to 1.0 (most live)
valence describes the musical postiveness/negativeness conveyed by a track. Values range from 0.0 (least positive) to 1.0 (most positive)
tempo the overall estimated tempo of a track in beats per minute (BPM)
time_signature tells how the music is to be counted

The artist-uris.csv dataset contains data on roughly 81k artists and is described by 2 columns (header names are not provided)

[id_artists] artist unique ID
[artists] artist name

The artist.csv dataset is very similar to the tracks.csv dataset but also includes a popularity metric for the artists.

popularity the popularity of an artist. Values range from 0 (least popular= to 100 (most popular)

P.S. What is the #66DaysOfData Challenge?

The idea is to spend around 5-10 minutes on a specific data science project each day for 66 days and share your progress on your favorite social media platform with #66daysofdata. Ken Jee is the original instigator of #66daysofdata. Why 66 days? Because that's the average time it takes us to get practiced at doing something. In this case, data science with KNIME. Find the full roadmap here.

You Might Also Like

The Importance of Community in Data Science

Nobody is an island - even less so a data scientist As first published in Data Science Central. Assembling predictive analytics workflows benefits from he...

November 21, 2019 – by Rosaria Silipo &  Paolo Tamagnini

What are you looking for?