Tutorial 3
Adding data to the database¶
Korus stores acoustic annotations and metadata in an SQLite database. The data are organized in several cross-referenced tables. The main tables are,
- deployment: metadata pertaining hydrophone deployments
- file: metadata pertaining to audio files
- taxonomy: taxonomies used for annotating the acoustic data
- job: metadata pertaining to annotation jobs
- annotation: acoustic annotations
In this tutorial you will learn how to create a Korus database and add data to it.
Getting ready ¶
We begin by importing the necessary modules, classes, functions, etc.
import os
import json
import sqlite3
import pandas as pd
from datetime import datetime
import korus.db as kdb
from korus.util import collect_audiofile_metadata
from korus.tax import AcousticTaxonomy
Creating a database ¶
Creating an empty SQLite database with the Korus schema is straightforward. Just specify the desired path of the database file and call the create_db
function. (If a file already already exists at the specified path, you will be asked if you want to overwrite it.)
path_db = "mydb.sqlite"
conn = kdb.create_db(path_db)
mydb.sqlite already exists, overwrite? [y/N]y
The call to create_db
returns an open database connection, conn
, which we will be using to add data to the database. Once we are done, we must remember to commit the changes with conn.commit()
and finally closing the connection to the database with conn.close()
.
Adding a hydrophone deployment ¶
Having created the database, the next step is usually to add a hydrophone deployment. The Korus schema allows you to enter various relevant metadata pertaining to a hydrophone deployment. For example, the time period during which the hydrophone was operating, the geographical location (latitude and longitude), and the deployment depth, to name but a few.
To view all the metadata fields, you can do
cursor = conn.execute("SELECT name,type FROM PRAGMA_TABLE_INFO('deployment')")
columns = cursor.fetchall()
print(columns)
[('id', 'INTEGER'), ('name', 'TEXT'), ('owner', 'TEXT'), ('start_utc', 'TEXT'), ('end_utc', 'TEXT'), ('location', 'TEXT'), ('latitude_deg', 'REAL'), ('longitude_deg', 'REAL'), ('depth_m', 'REAL'), ('trajectory', 'JSON'), ('latitude_min_deg', 'REAL'), ('latitude_max_deg', 'REAL'), ('longitude_min_deg', 'REAL'), ('longitude_max_deg', 'REAL'), ('depth_min_m', 'REAL'), ('depth_max_m', 'REAL'), ('license', 'TEXT'), ('hydrophone', 'TEXT'), ('bits_per_sample', 'INTEGER'), ('sample_rate', 'INTEGER'), ('num_channels', 'INTEGER'), ('sensitivity', 'REAL'), ('comments', 'TEXT')]
This lists the names and the types of all the columns in the deployment
table.
While you should always try to enter data for as many metadata fields as possible, the majority of the fields are optional. In fact, the only metadata field required for submitting a hydrophone deployment to the database is the name
field. (The id
field is automatically populated.)
Thus, using standard SQLite syntax, we can easily add a deployment to the database, providing only a descriptive name,
cursor = conn.cursor()
cursor = cursor.execute("INSERT INTO deployment (name) VALUES ('My First Hydrophone Deployment')")
Now, let's add a more comprehensive set of metadata to the deployment table. This time, we will be using Korus's insert_row
function to insert the data, which allows us to submit the data in the form of a dictionary, which can be convenient.
# collect the data in a dictionary
data = {
"owner": "Mr. Moose", # owner of the data, can be a person or an organisation
"name": "Seaforth2020", # descriptive name for the deployment
"start_utc": "2020-06-01", # UTC start time of recording
"end_utc": "2020-10-31", # UTC end time of recording
"location": "Seaforth, NS, Canada", # descriptive name for the deployment location
"latitude_deg": 44.660784, # latitude, in degrees
"longitude_deg": -63.253176, # longitude, in degrees
"depth_m": 8.0, # depth, in meters
"license": "CCBY", # license/terms governing access to and use of data
"hydrophone": "VeryFineInstruments Inc., SuperListener101", # hydrophone make and model
"bits_per_sample": 24, # bits per sample
"sample_rate": 128000, # sampling rate in samples/s
"num_channels": 1, # no. channels
"sensitivity": -60, # calibrated sensitivity in ...
"comments": "popular surfing spot on the Canadian east coast" # additional comments
}
# submit it to the database
cursor = kdb.insert_row(conn, table_name="deployment", values=data)
Note that, although not used here, the deployment table also has columns for storing spatio-temporal trajectory coordinates - relevant for mobile hydrophones, e.g., when towed behind a vessel or mounted on a glider.
The call to insert_row
returns a cursor, just like the execute
function above. Using this cursor, we can query the table using standard SQLite syntax to check that the data was correctly inserted. Here we obtain the names, UTC start times, and locations of all the entries in the deployment table.
cursor = cursor.execute("SELECT name, start_utc, location FROM deployment")
print(cursor.fetchall())
[('My First Hydrophone Deployment', None, None), ('Seaforth2020', '2020-06-01', 'Seaforth, NS, Canada')]
Finally, make sure to commit the changes, and if you are done interacting with the database, close the connection to it.
conn.commit()
conn.close()
Adding audio files ¶
Having successfully added two hydrophone deployments to the database, let us now add some audio files to the database. Note that we are only adding information about the audio files (metadata) and not the acoustic data contained within the files. We begin by reconnecting to the database.
conn = sqlite3.connect(path_db)
Korus stores audio file metadata in a table named file
. Let us inspect its columns,
cursor = conn.execute("SELECT name,type FROM PRAGMA_TABLE_INFO('file')")
columns = cursor.fetchall()
print(columns)
[('id', 'INTEGER'), ('deployment_id', 'INTEGER'), ('filename', 'TEXT'), ('dir_path', 'TEXT'), ('location', 'TEXT'), ('sample_rate', 'INTEGER'), ('num_samples', 'INTEGER'), ('downsample', 'TEXT'), ('format', 'TEXT'), ('codec', 'TEXT'), ('start_utc', 'TEXT')]
The first column (id
) is the file unique ID. Every file added to the database is automatically assigned an ID, so this is not something we need to worry about. (SQLite uses integers as IDs, starting at 1.)
The second column (deployment_id
) is the ID of the hydrophone deployment that the file belongs to. Note that this is a required field, meaning you will not be able to add an audio file without linking it to an (already existing) deployment in the database.
You can use a query like this to view the IDs and names of all the deployments in the database,
cursor = cursor.execute("SELECT id, name FROM deployment")
print(cursor.fetchall())
[(1, 'My First Hydrophone Deployment'), (2, 'Seaforth2020')]
You are also required to specify the audio filename (filename
), the directory path (dir_path
), and a descriptive name for where the data is physically stored (location
). The last two required fields are the sampling rate (sample_rate
) and the number of samples (num_samples
). The remaining fields are all optional.
Let us try to add a single audio file to the database, linking it to the first deployment and providing only the minimum required metadata.
data = {
"deployment_id": 1,
"filename": "a-10-second-long-recording.wav",
"dir_path": "/home/mrmoose/acoustic-data/",
"location": "personal laptop",
"sample_rate": 32000, #sampling rate in samples/s
"num_samples": 320000 #file size, no. samples
}
cursor = kdb.insert_row(conn, table_name="file", values=data)
This works well for a single audio file, but if your dataset consists of many files entering metadata for file manually quickly becomes tedious. In such cases, you may want to automate the metadata collection. In particular, you may want to automatically
- generate a list of all the audio files in a given directory
- parse timestamps from the filenames
- inspect the files to determine their sampling rate and size
Fortunately, Korus has a function that does exactly this. Let's see how it works.
# define a function for parsing the timestamps embedded in the audio filenames
# this function should take the filename as input (in the form of a string) and
# return the UTC start time of the file (in the form of a datetime object)
def timestamp_parser(x):
fmt = "%Y%m%dT%H%M%S.%f"
p = x.rfind(".")
s = x[p-20: p-1]
return datetime.strptime(s, fmt)
# now let's collect metadata on all the FLAC files in the .data/ folder (including subfolders)
df = collect_audiofile_metadata("./data", "FLAC", timestamp_parser)
# view the contents of the returned dataframe
print(df.to_string())
format start_utc num_samples sample_rate filename dir_path 0 FLAC 2013-06-23 08:00:00.116 1800055 1000 audio_20130623T080000.116Z.flac 20130623 1 FLAC 2013-06-23 08:30:00.117 1800055 1000 audio_20130623T083000.117Z.flac 20130623 2 FLAC 2013-06-24 08:00:00.118 1800055 1000 audio_20130624T080000.118Z.flac 20130624
The collect_audiofile_metadata
function returns the metadata in the form of a pandas DataFrame. Having reviewed the metadata, we are now ready to submit the data to the database. We do this one entry at the time,
for _,row in df.iterrows(): #loop over all rows in the dataframe
data = row.to_dict() #convert pd.Series to dict
# additional required metadata
data["deployment_id"] = 2
data["location"] = "personal laptop"
# insert in the 'file' table
cursor = kdb.insert_row(conn, table_name="file", values=data)
Let's check that things look alright by querying for the deployment ID, filename, and UTC start times of all the audio files in the database,
cursor = cursor.execute("SELECT deployment_id, filename, start_utc FROM file")
print(cursor.fetchall())
[(1, 'a-10-second-long-recording.wav', None), (2, 'audio_20130623T080000.116Z.flac', '2013-06-23 08:00:00.116'), (2, 'audio_20130623T083000.117Z.flac', '2013-06-23 08:30:00.117'), (2, 'audio_20130624T080000.118Z.flac', '2013-06-24 08:00:00.118')]
Happy with the changes made, we commit them to the database and close the connection.
conn.commit()
conn.close()
Adding a taxonomy ¶
As always, we begin by reconnecting to the database.
conn = sqlite3.connect(path_db)
Next, we import the annotation taxonomy that we created in the 1st tutorial,
kdb.import_taxonomy(conn, src="tax_t1.sqlite", name="my-first-taxonomy")
<sqlite3.Cursor at 0x7f97a760df80>
To verify that the taxonomy was imported into our database, we form the following SQLite query,
cursor = conn.execute("SELECT id,name,version FROM taxonomy")
print(cursor.fetchall())
[(1, 'my-first-taxonomy', '1'), (2, 'my-first-taxonomy', '2')]
Let us retrieve the 1st version of the taxonomy to inspect its node structure,
tax = kdb.get_taxonomy(conn, taxonomy_id=1)
tax.show(append_name=True) #sound sources
Unknown ├── Anthro [Anthropogenic] │ └── Boat │ ├── Engine │ └── Prop [Propeller] └── Bio [Biological] └── Whale ├── HW [Humpback whale] └── NARW [North Atlantic right whale]
tax.sound_types("NARW").show(append_name=True) #sound types for NARW
Unknown ├── GS [Gun shot] └── TC [Tonal call] ├── FU [Faint Upcall] └── LU [Loud Upcall]
Finally, we remember to commit the changes and close the database connection.
conn.commit()
conn.close()
Adding an annotation job ¶
As always, we begin by reconnecting to the database,
conn = sqlite3.connect(path_db)
The Korus schema can accomodate annotations jobs performed by humans as well as machines/algorithms. In this tutorial, however, we shall limit ourselves to human annotation jobs.
Let us inspect all the columns of the job
table,
cursor = conn.execute("SELECT name,type FROM PRAGMA_TABLE_INFO('job')")
columns = cursor.fetchall()
print(columns)
[('id', 'INTEGER'), ('taxonomy_id', 'INTEGER'), ('model_id', 'INTEGER'), ('annotator', 'TEXT'), ('primary_sound', 'JSON'), ('background_sound', 'JSON'), ('is_exhaustive', 'INTEGER'), ('model_config', 'JSON'), ('start_utc', 'TEXT'), ('end_utc', 'TEXT'), ('by_human', 'INTEGER'), ('by_machine', 'INTEGER'), ('comments', 'TEXT')]
For human-made annotations, you are advised to fill in the following fields,
taxonomy_id
: The identifier of the taxonomy version used to annotate soundsprimary_sound
: Sounds (source,type) that were systematically annotatedbackground_sound
: Sounds that were annotated only opportunisticallyis_exhaustive
: Whether all primary sounds occurrences were annotatedannotator
: The name or initials of the person who made the annotationsstart_utc
,end_utc
: Time period during which the annotation job was performed
To be more concrete, let us assume that the acoustic analyst was tasked with annotating every occurrence of the (faint and loud) upcalls and gunshots made by the North Atlantic right whale, while only annotating anthropogenic sounds opportunistically. You would then specify
taxonomy_id = 1
primary_sound = [("NARW","FU"), ("NARW","LU"), ("NARW","GS")]
background_sound = ("Anthro","%") # % can be used as wildcard
is_exhaustive = True
Let us now collect all the metadata in a dictionary and submit the entry to the database as follows,
# collect all the data in a dict
data = {
"taxonomy_id": taxonomy_id,
"annotator": "AB",
"primary_sound": primary_sound,
"background_sound": background_sound,
"is_exhaustive": is_exhaustive,
"start_utc": "2022-10",
"end_utc": "2023-03",
"comments": "Vessel noise annotated opportunistically"
}
# insert into the database
kdb.insert_job(conn, values=data)
<sqlite3.Cursor at 0x7f97a75da730>
Before committing the changes and closing the connection to the database, let us add another annotation job. For this second job, we shall assume that the analyst was using version 2 of the taxonomy and only annotated particularly clear or nice examples of the NARW upcall.
# collect all the data in a dict
data = {
"taxonomy_id": 2,
"annotator": "CD",
"primary_sound": ("NARW","Upcall"),
"is_exhaustive": False,
"start_utc": "2023-04-01",
"end_utc": "2023-04-07",
"comments": "Only annotated clear and nice upcalls"
}
# insert into the database
kdb.insert_job(conn, values=data)
<sqlite3.Cursor at 0x7f97a75da8f0>
conn.commit()
conn.close()
Linking files to jobs ¶
Having specified who made the annotations, when they were made, and (most importantly) how they were made, there is still one important piece of information missing, namely, which audio files were annotated.
To retrieve the audio file IDs, we begin by querying the database for the relevant hydrophone deployment. Specifically, we search for deployments where NS
or Nova Scotia
appears in the location name and with data collected between 2019 and 2021.
# re-establish the connection to the database
conn = sqlite3.connect(path_db)
# Form SQLite query to search the database for deployments based on location name and time period
query = """
SELECT
id
FROM
deployment
WHERE
location LIKE '%NS%'
OR location LIKE '%Nova Scotia%'
AND start_utc >= '2019'
AND end_utc <= '2021'
"""
c = conn.cursor()
rows = c.execute(query).fetchall()
print(rows)
[(2,)]
Our query returned one deployment matching the search criteria, with deployment ID = 2. Let us now retrieve the filenames and IDs of all the audio files associated with this deployment.
deployment_id = rows[0][0]
query = f"""
SELECT
filename,id
FROM
file
WHERE
deployment_id = '{deployment_id}'
"""
rows = c.execute(query).fetchall()
print(rows)
[('audio_20130623T080000.116Z.flac', 2), ('audio_20130623T083000.117Z.flac', 3), ('audio_20130624T080000.118Z.flac', 4)]
Finally, let us link these three audio files to first annotation job, i.e., the job performed by annotator AB
in the previous section. Note that we only need the file IDs for this. The filenames were included in the query merely for illustration.
# link files to job, one file at the time
for row in rows:
file_id = row[1]
data = {
"job_id": 1,
"file_id": file_id,
"channel": 0 #recording channel (0,1,2,...)
}
kdb.insert_row(conn, table_name="file_job_relation", values=data)
Before we conclude this section, let us also link an audio file to the second annotation job,
data = {
"job_id": 2,
"file_id": 1,
"channel": 0
}
kdb.insert_row(conn, table_name="file_job_relation", values=data)
<sqlite3.Cursor at 0x7f97a75da9d0>
# commit the changes and close the connection to the database
conn.commit()
conn.close()
Adding annotations ¶
We are now finally at the point where we can add annotations to the database. Although Korus contains functions for automatically ingesting annotation tables from CSV files or Excel spreadsheets, we shall first try to add annotations manually, as this will give us a better appreciation for the steps involved.
It is recommended to use the function add_annotations
for adding annotations to the database. This function accepts a dictionary or a Pandas DataFrame as its input, with a specific set of keys or columns, as detailed in the Korus API Documentation.
In this example, we are given a CSV file with the two columns filename
and call_time
, specifying the names of the audio files and the within-file offsets (in seconds) of the acoustic signals identified by the acoustic analyst.
We shall further assume that the analyst was using version 1 of the annotation taxonomy, and that all the entries in the CSV file are instances of the 'Loud Upcall' (LU) attribute to North Atlantic right whales (NARW).
df = pd.read_csv("data/annotations.csv") #load data from CSV file into a Pandas DataFrame
print(df.to_string())
filename call_time 0 audio_20130623T080000.116Z.flac 1128.840 1 audio_20130623T080000.116Z.flac 1153.526 2 audio_20130623T080000.116Z.flac 1196.778 3 audio_20130623T080000.116Z.flac 1227.642 4 audio_20130623T080000.116Z.flac 1358.181 5 audio_20130623T080000.116Z.flac 1437.482 6 audio_20130623T080000.116Z.flac 1489.288 7 audio_20130623T080000.116Z.flac 1511.670 8 audio_20130623T080000.116Z.flac 1530.595 9 audio_20130623T080000.116Z.flac 1536.580 10 audio_20130623T080000.116Z.flac 1714.372 11 audio_20130623T080000.116Z.flac 1768.251 12 audio_20130623T080000.116Z.flac 1777.835 13 audio_20130623T083000.117Z.flac 68.149 14 audio_20130623T083000.117Z.flac 688.507 15 audio_20130623T083000.117Z.flac 755.940 16 audio_20130623T083000.117Z.flac 770.440 17 audio_20130624T080000.118Z.flac 68.853 18 audio_20130624T080000.118Z.flac 105.927 19 audio_20130624T080000.118Z.flac 1057.015 20 audio_20130624T080000.118Z.flac 1067.282 21 audio_20130624T080000.118Z.flac 1290.563 22 audio_20130624T080000.118Z.flac 1378.955 23 audio_20130624T080000.118Z.flac 1428.648 24 audio_20130624T080000.118Z.flac 1663.622 25 audio_20130624T080000.118Z.flac 1676.682
In order to submit these annotations to the database, we will need to look up the file identifiers. We will also need to assign each annotation a start time and a duration, in place of the call_time
which corresponds roughly to the midpoint of the call. Since the duration is unknown, we will be using a fixed-sized window of 3.0 seconds, fully sufficient to capture the entire upcalls which typically are only 1 second long.
Let's begin by looking up the file identifiers and add them to the data frame.
# list the unique filenames that appear in the annotation table
filenames = df["filename"].unique().tolist()
print(filenames)
['audio_20130623T080000.116Z.flac', 'audio_20130623T083000.117Z.flac', 'audio_20130624T080000.118Z.flac']
# re-establish the connection to the database
conn = sqlite3.connect(path_db)
c = conn.cursor() # get cursor
# query the 'file' table for files with the correct filename and retrieve their ID
file_ids = dict()
for filename in filenames:
query = f"SELECT id FROM file WHERE filename = '{filename}'"
file_id = c.execute(query).fetchall()[0][0]
file_ids[filename] = file_id # create a dictionary mapping: filename -> file_id
print(file_ids)
{'audio_20130623T080000.116Z.flac': 2, 'audio_20130623T083000.117Z.flac': 3, 'audio_20130624T080000.118Z.flac': 4}
# finally, let's add a column to our data frame with the file IDs
df["file_id"] = df["filename"].apply(lambda x: file_ids[x])
print(df)
filename call_time file_id 0 audio_20130623T080000.116Z.flac 1128.840 2 1 audio_20130623T080000.116Z.flac 1153.526 2 2 audio_20130623T080000.116Z.flac 1196.778 2 3 audio_20130623T080000.116Z.flac 1227.642 2 4 audio_20130623T080000.116Z.flac 1358.181 2 5 audio_20130623T080000.116Z.flac 1437.482 2 6 audio_20130623T080000.116Z.flac 1489.288 2 7 audio_20130623T080000.116Z.flac 1511.670 2 8 audio_20130623T080000.116Z.flac 1530.595 2 9 audio_20130623T080000.116Z.flac 1536.580 2 10 audio_20130623T080000.116Z.flac 1714.372 2 11 audio_20130623T080000.116Z.flac 1768.251 2 12 audio_20130623T080000.116Z.flac 1777.835 2 13 audio_20130623T083000.117Z.flac 68.149 3 14 audio_20130623T083000.117Z.flac 688.507 3 15 audio_20130623T083000.117Z.flac 755.940 3 16 audio_20130623T083000.117Z.flac 770.440 3 17 audio_20130624T080000.118Z.flac 68.853 4 18 audio_20130624T080000.118Z.flac 105.927 4 19 audio_20130624T080000.118Z.flac 1057.015 4 20 audio_20130624T080000.118Z.flac 1067.282 4 21 audio_20130624T080000.118Z.flac 1290.563 4 22 audio_20130624T080000.118Z.flac 1378.955 4 23 audio_20130624T080000.118Z.flac 1428.648 4 24 audio_20130624T080000.118Z.flac 1663.622 4 25 audio_20130624T080000.118Z.flac 1676.682 4
Now, let's assign start times and durations for every annotation:
# compute the annotation start time in milliseconds
df["start_ms"] = (df["call_time"] - 1.5) * 1e3
# set the duration to 3.0 seconds
df["duration_ms"] = 3000
# cast both as integers (the type used for storing these quantities in the database)
df["start_ms"] = df["start_ms"].astype("int")
df["duration_ms"] = df["duration_ms"].astype("int")
# drop the - now obsolete - filename and call_time columns
df = df.drop(columns=["filename","call_time"])
# specify the 'granularity' of each annotation
df["granularity"] = "window"
print(df)
file_id start_ms duration_ms granularity 0 2 1127340 3000 window 1 2 1152026 3000 window 2 2 1195278 3000 window 3 2 1226142 3000 window 4 2 1356681 3000 window 5 2 1435982 3000 window 6 2 1487788 3000 window 7 2 1510170 3000 window 8 2 1529095 3000 window 9 2 1535080 3000 window 10 2 1712872 3000 window 11 2 1766751 3000 window 12 2 1776335 3000 window 13 3 66649 3000 window 14 3 687007 3000 window 15 3 754440 3000 window 16 3 768940 3000 window 17 4 67353 3000 window 18 4 104427 3000 window 19 4 1055515 3000 window 20 4 1065782 3000 window 21 4 1289063 3000 window 22 4 1377455 3000 window 23 4 1427148 3000 window 24 4 1662122 3000 window 25 4 1675182 3000 window
Finally, we must add columns with the appropriate labels for the sound source and sound type,
df["sound_source"] = "NARW"
df["sound_type"] = "LU"
We are now ready to submit the annotations to the database.
annot_ids = kdb.add_annotations(conn, annot_tbl=df, job_id=1)
Note that the add_annotations
returns the indices of the annotations just added.
Another useful feature of Korus is the add_negatives
function, which automatically generates annotations for the 'quiet' periods, i.e., periods during which no 'primary sounds' were heard.
neg_ids = kdb.add_negatives(conn, job_id=1)
Content that the annotations have been inserted into our database, we commit the changes and close the connection,
conn.commit()
conn.close()
Customization ¶
It is not recommended to remove existing columns or tables from the Korus database as this may compromise core functionalities of the Korus API. However, you are more than welcome to extend the database schema. To do so, you will have to interact directly with the database using SQLite syntax. For example, adding a column to an existing table is as easy as,
# open a connection to the database
conn = sqlite3.connect(path_db)
# add a column named 'hydrophone_color' to the deployment table with default value 'yellow'
conn.execute("ALTER TABLE deployment ADD COLUMN hydrophone_color TEXT DEFAULT 'yellow'")
# commit changes and close
conn.commit()
conn.close()