Tutorial 3
Adding data to the database¶
Korus stores acoustic annotations and metadata in a number of cross-referenced tables. The main tables are,
- deployment: metadata about hydrophone deployments
- file: metadata pertaining to audio files
- taxonomy: taxonomies used for annotating the acoustic data
- job: metadata about to annotation jobs
- annotation: acoustic annotations
In this tutorial you will learn how to create a Korus database and add data to it.
Getting ready ¶
We begin by importing the necessary modules, classes, functions, etc.
import os
import shutil
import json
import sqlite3
import pandas as pd
from datetime import datetime, timezone
from korus.database import SQLiteDatabase
from korus.audio import collect_audiofile_metadata
Connecting to the database ¶
We begin by connecting to the database from Tutorial 1, where we created an annotation taxonomy.
# make a copy of the database from Tutorial 1
# (if the database file already exists, delete it first)
db_path = "mydb.sqlite"
if os.path.exists(db_path):
os.remove(db_path)
shutil.copyfile("tax_t1.sqlite", db_path)
# connect to the database
db = SQLiteDatabase(db_path)
Adding a hydrophone deployment ¶
Having created the database, the next step is usually to add a hydrophone deployment. The Korus schema allows you to enter various relevant metadata pertaining to a hydrophone deployment. For example, the time period during which the hydrophone was operating, the geographical location (latitude and longitude), and the deployment depth, to name but a few.
To view all the metadata fields, you can do
print(db.deployment.info())
Name: deployment Fields: Name Type Description Required Default Value Allowed Values ------------- -------- ------------------------------------------------------------- ---------- --------------- ---------------- name str Deployment name Y None start_utc datetime Start time of deployment N None end_utc datetime End time of deployment N None latitude_deg float Deployment latitude in degrees N None longitude_deg float Deployment longitude in degrees N None depth_m float Deployment depth in meters N None trajectory list Sequence of timestamped coordinates (timestamp,lat,lon,depth) N None
This lists the names and the types of all the columns in the deployment table.
While you should always try to enter data for as many metadata fields as possible, the majority of the fields are optional. In fact, the only metadata field required for submitting a hydrophone deployment to the database is the name field.
Thus, we can easily add a deployment to the database, providing only a descriptive name,
id = db.deployment.add({"name": "My First Hydrophone Deployment"})
print(id) #id assigned to the entry just added
0
Now, let's add a more comprehensive set of metadata to the deployment table.
# collect the data in a dictionary
data = {
"name": "Seaforth2020", # descriptive name for the deployment
"start_utc": datetime(2020, 6, 1, tzinfo=timezone.utc), # UTC start time of recording
"end_utc": datetime(2020, 10, 31, tzinfo=timezone.utc), # UTC end time of recording
"latitude_deg": 44.660784, # latitude, in degrees
"longitude_deg": -63.253176, # longitude, in degrees
"depth_m": 8.0, # depth, in meters
}
# submit it to the database
cursor = db.deployment.add(data)
Now, let's query the table to check that the data was correctly inserted. Here we obtain the names, UTC start times, and locations of all the entries in the deployment table.
df = db.deployment.get(fields=["name","start_utc","latitude_deg","longitude_deg"], as_pandas=True)
print(df.to_string())
name start_utc latitude_deg longitude_deg 0 My First Hydrophone Deployment NaT NaN NaN 1 Seaforth2020 2020-06-01 00:00:00+00:00 44.660784 -63.253176
Before proceeding, let's close the connection to the database.
db.backend.close()
Adding audio files ¶
Having successfully added two hydrophone deployments to the database, let us now add some audio files to the database. Note that we are only adding information about the audio files (metadata) and not the acoustic data contained within the files. We begin by reconnecting to the database.
db = SQLiteDatabase(db_path)
Korus stores audio file metadata in a table named file. Let us inspect its columns,
print(db.file.info())
Name: file Fields: Name Type Description Required Default Value Allowed Values ------------- -------- ------------------------------------------------------------------------------------------------ ---------- --------------- ---------------- deployment_id int Deployment index Y None storage_id int Storage index Y None filename str Filename Y None relative_path str Directory path Y sample_rate int Sampling rate in Hz Y None num_samples int Number of samples Y None start_utc datetime Start time of recording (UTC) N None end_utc datetime End time of recording (UTC) [inferred from the start time, number of samples, and sampling rate] N None
The first and second column (deployment_id, storage_id) are the ID of the hydrophone deployment that produced the audio file and the ID of the data-storage location. Note that these are required fields, meaning you will not be able to add an audio file without linking it to an (already existing) deployment and data-storage location in the database.
You can use a query like this to view the IDs and names of all the deployments in the database,
df = db.deployment.get(fields=["name"], return_indices=True, as_pandas=True)
print(df.to_string())
name id 0 My First Hydrophone Deployment 1 Seaforth2020
You are also required to specify the audio filename (filename) and the relative directory path (relative_path). The last two required fields are the sampling rate (sample_rate) and the number of samples (num_samples). The remaining fields are all optional.
Let us try to add a single audio file to the database, linking it to the first deployment and providing only the minimum required metadata.
# first, add the data-storage location
data = {
"name": "My personal laptop",
"path": "/home/mrmoose/acoustic-data/",
}
storage_id = db.storage.add(data)
# now, add a single audio file
data = {
"deployment_id": 0,
"storage_id": storage_id,
"filename": "a-10-second-long-recording.wav",
"relative_path": "20200602",
"sample_rate": 32000, #sampling rate in samples/s
"num_samples": 320000 #file size, no. samples
}
file_id = db.file.add(data)
print(file_id)
0
This works well for a single audio file, but if your dataset consists of many files entering metadata for file manually quickly becomes tedious. In such cases, you may want to automate the metadata collection. In particular, you may want to automatically
- generate a list of all the audio files in a given directory
- parse timestamps from the filenames
- inspect the files to determine their sampling rate and size
Fortunately, Korus has a function that does exactly this. Let's see how it works.
# define a function for parsing the timestamps embedded in the audio filenames
# this function should take the filename as input (in the form of a string) and
# return the UTC start time of the file (in the form of a datetime object)
def timestamp_parser(x):
fmt = "%Y%m%dT%H%M%S.%f"
p = x.rfind(".")
s = x[p-20: p-1]
return datetime.strptime(s, fmt)
# now let's collect metadata on all the FLAC files in the .data/ folder (including subfolders)
df = collect_audiofile_metadata("./data", "FLAC", timestamp_parser)
# view the contents of the returned dataframe
print(df.to_string())
format start_utc num_samples sample_rate end_utc filename relative_path 0 FLAC 2013-06-23 08:00:00.116 1800055 1000 2013-06-23 08:30:00.171 audio_20130623T080000.116Z.flac 20130623 1 FLAC 2013-06-23 08:30:00.117 1800055 1000 2013-06-23 09:00:00.172 audio_20130623T083000.117Z.flac 20130623 2 FLAC 2013-06-24 08:00:00.118 1800055 1000 2013-06-24 08:30:00.173 audio_20130624T080000.118Z.flac 20130624
The collect_audiofile_metadata function returns the metadata in the form of a pandas DataFrame. Having reviewed the metadata, we are now ready to submit the data to the database. We do this one entry at the time,
for _,row in df.iterrows(): #loop over all rows in the dataframe
data = row.to_dict() #convert pd.Series to dict
# additional required metadata
data["storage_id"] = 0
data["deployment_id"] = 1
# drop 'format'
data.pop("format")
# insert in the 'file' table
db.file.add(data)
Let's check that things look alright by querying for the deployment ID, filename, and UTC start times of all the audio files in the database,
df = db.file.get(fields=["deployment_id", "filename", "start_utc"], as_pandas=True)
print(df.to_string())
deployment_id filename start_utc 0 0 a-10-second-long-recording.wav NaT 1 1 audio_20130623T080000.116Z.flac 2013-06-23 08:00:00.116000+00:00 2 1 audio_20130623T083000.117Z.flac 2013-06-23 08:30:00.117000+00:00 3 1 audio_20130624T080000.118Z.flac 2013-06-24 08:00:00.118000+00:00
Happy with the changes made, we close the connection.
db.backend.close()
Adding an annotation job ¶
As always, we begin by reconnecting to the database,
db = SQLiteDatabase(db_path)
Let us inspect all the columns of the job table,
print(db.job.info())
Name: job Fields: Name Type Description Required Default Value Allowed Values --------------- -------- ----------------------------------------- ---------- --------------- ---------------- taxonomy_id int Taxonomy index N None annotator str Name of person who annotated the data N None is_exhaustive bool Whether all primary sounds were annotated N None target list Systematically annotated sounds N None completion_date datetime Completion date of annotation effort N None
Before proceeding to adding our first annotation job, let us find out how many different versions of the taxonomy have been released,
df = db.taxonomy.get(fields=["version","comment"], return_indices=True, as_pandas=True)
print(df.to_string())
version comment id 0 NaN added some sound sources and sound types 1 1.0 this is the first version of the annotation taxonomy 2 2.0 merged FU and LU sound types for NARW into a single Upcall sound type
The first entry (id=0) is the current draft (which does not have a version number), while the second and third entries (id=1,2) are the first and second release, respectively. Only released versions of the taxonomy should be used for annotation jobs, as the draft may still change.
To be more concrete, let us assume that the acoustic analyst was tasked with annotating every occurrence of the (faint and loud) upcalls and gunshots made by the North Atlantic right whale, while only annotating anthropogenic sounds opportunistically. You would then specify
taxonomy_id = 1
target = [("NARW","FU"), ("NARW","LU"), ("NARW","GS")]
is_exhaustive = True
Let us now collect all the metadata in a dictionary and submit the entry to the database as follows,
# collect all the data in a dict
data = {
"taxonomy_id": taxonomy_id,
"annotator": "AB",
"target": target,
"is_exhaustive": is_exhaustive,
"completion_date": datetime(2023,3,1,tzinfo=timezone.utc),
}
# insert into the database
job_id = db.job.add(data)
print(job_id)
0
Before committing the changes and closing the connection to the database, let us add another annotation job. For this second job, we shall assume that the analyst was using version 2 of the taxonomy and only annotated particularly clear or nice examples of the NARW upcall. (Note that we don't need to specify the "target" sounds for non-exhaustive jobs.)
# collect all the data in a dict
data = {
"taxonomy_id": 2,
"annotator": "CD",
"is_exhaustive": False,
"completion_date": datetime(2023,4,7,tzinfo=timezone.utc),
}
# insert into the database
job_id = db.job.add(data)
print(job_id)
1
db.backend.close()
Linking files to jobs ¶
Having specified who made the annotations, when they were made, and (most importantly) how they were made, there is still one important piece of information missing, namely, which audio files were annotated.
To retrieve the audio file IDs, we begin by querying the database for the relevant hydrophone deployment. Specifically, we search for deployments starting between 2019 and 2022
# re-establish the connection to the database
db = SQLiteDatabase(db_path)
# search the database for deployments based on time period
cond = {"start_utc": (datetime(2019,1,1), datetime(2022,1,1))}
deployment_ids = db.deployment.reset_filter().filter(cond).indices
print(deployment_ids)
[1]
Our query returned one deployment matching the search criteria, with deployment ID = 1. Let us now retrieve the filenames and IDs of all the audio files associated with this deployment.
cond = {"deployment_id": 1}
file_ids = db.file.reset_filter().filter(cond).indices
print(file_ids)
[1, 2, 3]
Finally, let us link these three audio files to first annotation job (id=0), i.e., the job performed by annotator AB in the previous section. Note that we only need the file IDs for this. The filenames were included in the query merely for illustration.
# link files to job, one file at the time
for file_id in file_ids:
db.job.add_file(0, file_id)
Before we conclude this section, let us also link an audio file to the second annotation job,
db.job.add_file(1, 0)
Finally, we remember to close the connection to the database.
db.backend.close()
Adding annotations ¶
We are now finally at the point where we can add annotations to the database. Although Korus contains functions for automatically ingesting annotation tables from CSV files or Excel spreadsheets, we shall first try to add annotations manually, as this will give us a better appreciation for the steps involved.
Let's take a look at the fields of the annotation table,
print(db.annotation.info())
Name: annotation Fields: Name Type Description Required Default Value Allowed Values ------------------ -------- --------------------------------------------------------- ---------- --------------- ---------------- deployment_id int Deployment index Y None job_id int Job index Y None file_id int File index N None label_id int Label index for confident classification N None tentative_label_id int Label index for tentative classification N None ambiguous_label_id list Label indices for ambiguous classification N None excluded_label_id list Label indices for excluded classes N None multiple_label_id list Label indices for multiple (batch) classification N None tag_id list Tag indices N None granularity_id int Granularity index Y 1 negative bool Automatically generated negative Y False num_files int Number of audio files Y 1 file_id_list list File indices N None start_utc datetime UTC start time N None duration_ms int Duration in milliseconds N None start_ms int Start time in milliseconds from the beginning of the file N None freq_min_hz int Lower frequency bound in Hz N None freq_max_hz int Upper frequency bound in Hz N None channel int Hydrophone channel Y 0 valid bool Validation status Y True comments str Additional observations N None Aliases: Field Alias Type Description ------------------ --------------- ------ ---------------------------------------------------- start_ms start float Start time in seconds from the beginning of the file duration_ms duration float Duration in seconds tag_id tag list Tag name granularity_id granularity str Granularity level label_id label tuple Specify label tuples in place of label IDs tentative_label_id tentative_label tuple Specify label tuples in place of label IDs ambiguous_label_id ambiguous_label list Specify label tuples in place of label IDs excluded_label_id excluded_label list Specify label tuples in place of label IDs multiple_label_id multiple_label list Specify label tuples in place of label IDs
Note how the annotation table has several "Aliases" which can be used in place of the standard "Fields" if desired.
In this example, we are given a CSV file with the two columns filename and call_time, specifying the names of the audio files and the within-file offsets (in seconds) of the acoustic signals identified by the acoustic analyst.
We shall further assume that the analyst was using version 1 of the annotation taxonomy, and that all the entries in the CSV file are instances of the 'Loud Upcall' (LU) attribute to North Atlantic right whales (NARW).
df = pd.read_csv("data/annotations.csv") #load data from CSV file into a Pandas DataFrame
print(df.to_string())
filename call_time 0 audio_20130623T080000.116Z.flac 1128.840 1 audio_20130623T080000.116Z.flac 1153.526 2 audio_20130623T080000.116Z.flac 1196.778 3 audio_20130623T080000.116Z.flac 1227.642 4 audio_20130623T080000.116Z.flac 1358.181 5 audio_20130623T080000.116Z.flac 1437.482 6 audio_20130623T080000.116Z.flac 1489.288 7 audio_20130623T080000.116Z.flac 1511.670 8 audio_20130623T080000.116Z.flac 1530.595 9 audio_20130623T080000.116Z.flac 1536.580 10 audio_20130623T080000.116Z.flac 1714.372 11 audio_20130623T080000.116Z.flac 1768.251 12 audio_20130623T080000.116Z.flac 1777.835 13 audio_20130623T083000.117Z.flac 68.149 14 audio_20130623T083000.117Z.flac 688.507 15 audio_20130623T083000.117Z.flac 755.940 16 audio_20130623T083000.117Z.flac 770.440 17 audio_20130624T080000.118Z.flac 68.853 18 audio_20130624T080000.118Z.flac 105.927 19 audio_20130624T080000.118Z.flac 1057.015 20 audio_20130624T080000.118Z.flac 1067.282 21 audio_20130624T080000.118Z.flac 1290.563 22 audio_20130624T080000.118Z.flac 1378.955 23 audio_20130624T080000.118Z.flac 1428.648 24 audio_20130624T080000.118Z.flac 1663.622 25 audio_20130624T080000.118Z.flac 1676.682
In order to submit these annotations to the database, we will need to look up the file identifiers. We will also need to assign each annotation a start time and a duration, in place of the call_time which corresponds roughly to the midpoint of the call. Since the duration is unknown, we will be using a fixed-sized window of 3.0 seconds, fully sufficient to capture the entire upcalls which typically are only 1 second long.
Let's begin by looking up the file identifiers and add them to the data frame.
# list the unique filenames that appear in the annotation table
filenames = df["filename"].unique().tolist()
print(filenames)
['audio_20130623T080000.116Z.flac', 'audio_20130623T083000.117Z.flac', 'audio_20130624T080000.118Z.flac']
# re-establish the connection to the database
db = SQLiteDatabase(db_path)
# get the file IDs
data = db.file.get(fields="filename", return_indices=True)
file_ids = {filename: file_id for file_id,filename in data if filename in filenames}
print(file_ids)
{'audio_20130623T080000.116Z.flac': 1, 'audio_20130623T083000.117Z.flac': 2, 'audio_20130624T080000.118Z.flac': 3}
# finally, let's add a column to our data frame with the file IDs
df["file_id"] = df["filename"].apply(lambda x: file_ids[x])
print(df)
filename call_time file_id 0 audio_20130623T080000.116Z.flac 1128.840 1 1 audio_20130623T080000.116Z.flac 1153.526 1 2 audio_20130623T080000.116Z.flac 1196.778 1 3 audio_20130623T080000.116Z.flac 1227.642 1 4 audio_20130623T080000.116Z.flac 1358.181 1 5 audio_20130623T080000.116Z.flac 1437.482 1 6 audio_20130623T080000.116Z.flac 1489.288 1 7 audio_20130623T080000.116Z.flac 1511.670 1 8 audio_20130623T080000.116Z.flac 1530.595 1 9 audio_20130623T080000.116Z.flac 1536.580 1 10 audio_20130623T080000.116Z.flac 1714.372 1 11 audio_20130623T080000.116Z.flac 1768.251 1 12 audio_20130623T080000.116Z.flac 1777.835 1 13 audio_20130623T083000.117Z.flac 68.149 2 14 audio_20130623T083000.117Z.flac 688.507 2 15 audio_20130623T083000.117Z.flac 755.940 2 16 audio_20130623T083000.117Z.flac 770.440 2 17 audio_20130624T080000.118Z.flac 68.853 3 18 audio_20130624T080000.118Z.flac 105.927 3 19 audio_20130624T080000.118Z.flac 1057.015 3 20 audio_20130624T080000.118Z.flac 1067.282 3 21 audio_20130624T080000.118Z.flac 1290.563 3 22 audio_20130624T080000.118Z.flac 1378.955 3 23 audio_20130624T080000.118Z.flac 1428.648 3 24 audio_20130624T080000.118Z.flac 1663.622 3 25 audio_20130624T080000.118Z.flac 1676.682 3
Now, let's assign start times and durations for every annotation:
# compute the annotation start time in milliseconds
df["start_ms"] = (df["call_time"] - 1.5) * 1e3
# set the duration to 3.0 seconds
df["duration_ms"] = 3000
# cast both as integers (the type used for storing these quantities in the database)
df["start_ms"] = df["start_ms"].astype("int")
df["duration_ms"] = df["duration_ms"].astype("int")
# drop the - now obsolete - filename and call_time columns
df = df.drop(columns=["filename","call_time"])
# specify the 'granularity' of each annotation
df["granularity"] = "window"
# specify the job id
df["job_id"] = 0
print(df)
file_id start_ms duration_ms granularity job_id 0 1 1127340 3000 window 0 1 1 1152026 3000 window 0 2 1 1195278 3000 window 0 3 1 1226142 3000 window 0 4 1 1356681 3000 window 0 5 1 1435982 3000 window 0 6 1 1487788 3000 window 0 7 1 1510170 3000 window 0 8 1 1529095 3000 window 0 9 1 1535080 3000 window 0 10 1 1712872 3000 window 0 11 1 1766751 3000 window 0 12 1 1776335 3000 window 0 13 2 66649 3000 window 0 14 2 687007 3000 window 0 15 2 754440 3000 window 0 16 2 768940 3000 window 0 17 3 67353 3000 window 0 18 3 104427 3000 window 0 19 3 1055515 3000 window 0 20 3 1065782 3000 window 0 21 3 1289063 3000 window 0 22 3 1377455 3000 window 0 23 3 1427148 3000 window 0 24 3 1662122 3000 window 0 25 3 1675182 3000 window 0
Finally, we must add columns with the appropriate annotation label,
df["label"] = [("NARW","LU") for _ in range(len(df))]
We are now ready to submit the annotations to the database.
annot_ids = db.annotation.add_batch(df)
print(annot_ids)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
Another useful feature of Korus is the generate_negatives function, which automatically generates annotations for the 'quiet' periods, i.e., periods during which no 'primary sounds' were heard.
neg_ids = db.annotation.generate_negatives(job_id=0)
Content that the annotations have been inserted into our database, we commit the changes and close the connection,
db.backend.close()
Customization ¶
It is straightforward to add custom fields to the tables in the Korus database. For example,
db = SQLiteDatabase(db_path)
db.deployment.add_field(
name = "hydrophone_color",
type = str,
description = "Hydrophone color",
required = False,
)
print(db.deployment.info())
db.backend.close()
Name: deployment Fields: Name Type Description Required Default Value Allowed Values ---------------- -------- ------------------------------------------------------------- ---------- --------------- ---------------- name str Deployment name Y None start_utc datetime Start time of deployment N None end_utc datetime End time of deployment N None latitude_deg float Deployment latitude in degrees N None longitude_deg float Deployment longitude in degrees N None depth_m float Deployment depth in meters N None trajectory list Sequence of timestamped coordinates (timestamp,lat,lon,depth) N None hydrophone_color str Hydrophone color N None