Tutorial 3

add_data

Adding data to the database¶

Korus stores acoustic annotations and metadata in an SQLite database. The data are organized in several cross-referenced tables. The main tables are,

deployment: metadata pertaining hydrophone deployments
file: metadata pertaining to audio files
taxonomy: taxonomies used for annotating the acoustic data
job: metadata pertaining to annotation jobs
annotation: acoustic annotations

In this tutorial you will learn how to create a Korus database and add data to it.

Getting ready ¶

We begin by importing the necessary modules, classes, functions, etc.

In [1]:

import os
import json
import sqlite3
import pandas as pd
from datetime import datetime
import korus.db as kdb
from korus.util import collect_audiofile_metadata
from korus.tax import AcousticTaxonomy

Creating a database ¶

Creating an empty SQLite database with the Korus schema is straightforward. Just specify the desired path of the database file and call the create_db function. (If a file already already exists at the specified path, you will be asked if you want to overwrite it.)

In [2]:

path_db = "mydb.sqlite"
conn = kdb.create_db(path_db)

mydb.sqlite already exists, overwrite? [y/N]y

The call to create_db returns an open database connection, conn, which we will be using to add data to the database. Once we are done, we must remember to commit the changes with conn.commit() and finally closing the connection to the database with conn.close().

Adding a hydrophone deployment ¶

Having created the database, the next step is usually to add a hydrophone deployment. The Korus schema allows you to enter various relevant metadata pertaining to a hydrophone deployment. For example, the time period during which the hydrophone was operating, the geographical location (latitude and longitude), and the deployment depth, to name but a few.

To view all the metadata fields, you can do

In [3]:

cursor = conn.execute("SELECT name,type FROM PRAGMA_TABLE_INFO('deployment')")
columns = cursor.fetchall()
print(columns)

[('id', 'INTEGER'), ('name', 'TEXT'), ('owner', 'TEXT'), ('start_utc', 'TEXT'), ('end_utc', 'TEXT'), ('location', 'TEXT'), ('latitude_deg', 'REAL'), ('longitude_deg', 'REAL'), ('depth_m', 'REAL'), ('trajectory', 'JSON'), ('latitude_min_deg', 'REAL'), ('latitude_max_deg', 'REAL'), ('longitude_min_deg', 'REAL'), ('longitude_max_deg', 'REAL'), ('depth_min_m', 'REAL'), ('depth_max_m', 'REAL'), ('license', 'TEXT'), ('hydrophone', 'TEXT'), ('bits_per_sample', 'INTEGER'), ('sample_rate', 'INTEGER'), ('num_channels', 'INTEGER'), ('sensitivity', 'REAL'), ('comments', 'TEXT')]

This lists the names and the types of all the columns in the deployment table.

While you should always try to enter data for as many metadata fields as possible, the majority of the fields are optional. In fact, the only metadata field required for submitting a hydrophone deployment to the database is the name field. (The id field is automatically populated.)

Thus, using standard SQLite syntax, we can easily add a deployment to the database, providing only a descriptive name,

In [4]:

cursor = conn.cursor()
cursor = cursor.execute("INSERT INTO deployment (name) VALUES ('My First Hydrophone Deployment')")

Now, let's add a more comprehensive set of metadata to the deployment table. This time, we will be using Korus's insert_row function to insert the data, which allows us to submit the data in the form of a dictionary, which can be convenient.

In [5]:

# collect the data in a dictionary
data = {
    "owner": "Mr. Moose",  # owner of the data, can be a person or an organisation
    "name": "Seaforth2020",  # descriptive name for the deployment
    "start_utc": "2020-06-01", # UTC start time of recording
    "end_utc": "2020-10-31", # UTC end time of recording
    "location": "Seaforth, NS, Canada", # descriptive name for the deployment location
    "latitude_deg": 44.660784, # latitude, in degrees
    "longitude_deg": -63.253176, # longitude, in degrees
    "depth_m": 8.0, # depth, in meters
    "license": "CCBY", # license/terms governing access to and use of data
    "hydrophone": "VeryFineInstruments Inc., SuperListener101", # hydrophone make and model
    "bits_per_sample": 24, # bits per sample
    "sample_rate": 128000, # sampling rate in samples/s
    "num_channels": 1, # no. channels
    "sensitivity": -60, # calibrated sensitivity in ...
    "comments": "popular surfing spot on the Canadian east coast" # additional comments
}

# submit it to the database
cursor = kdb.insert_row(conn, table_name="deployment", values=data)

Note that, although not used here, the deployment table also has columns for storing spatio-temporal trajectory coordinates - relevant for mobile hydrophones, e.g., when towed behind a vessel or mounted on a glider.

The call to insert_row returns a cursor, just like the execute function above. Using this cursor, we can query the table using standard SQLite syntax to check that the data was correctly inserted. Here we obtain the names, UTC start times, and locations of all the entries in the deployment table.

In [6]:

cursor = cursor.execute("SELECT name, start_utc, location FROM deployment")
print(cursor.fetchall())

[('My First Hydrophone Deployment', None, None), ('Seaforth2020', '2020-06-01', 'Seaforth, NS, Canada')]

Finally, make sure to commit the changes, and if you are done interacting with the database, close the connection to it.

In [7]:

conn.commit()
conn.close()

Adding audio files ¶

Having successfully added two hydrophone deployments to the database, let us now add some audio files to the database. Note that we are only adding information about the audio files (metadata) and not the acoustic data contained within the files. We begin by reconnecting to the database.

In [8]:

conn = sqlite3.connect(path_db)

Korus stores audio file metadata in a table named file. Let us inspect its columns,

In [9]:

cursor = conn.execute("SELECT name,type FROM PRAGMA_TABLE_INFO('file')")
columns = cursor.fetchall()
print(columns)

[('id', 'INTEGER'), ('deployment_id', 'INTEGER'), ('filename', 'TEXT'), ('dir_path', 'TEXT'), ('location', 'TEXT'), ('sample_rate', 'INTEGER'), ('num_samples', 'INTEGER'), ('downsample', 'TEXT'), ('format', 'TEXT'), ('codec', 'TEXT'), ('start_utc', 'TEXT')]

The first column (id) is the file unique ID. Every file added to the database is automatically assigned an ID, so this is not something we need to worry about. (SQLite uses integers as IDs, starting at 1.)

The second column (deployment_id) is the ID of the hydrophone deployment that the file belongs to. Note that this is a required field, meaning you will not be able to add an audio file without linking it to an (already existing) deployment in the database.

You can use a query like this to view the IDs and names of all the deployments in the database,

In [10]:

cursor = cursor.execute("SELECT id, name FROM deployment")
print(cursor.fetchall())

[(1, 'My First Hydrophone Deployment'), (2, 'Seaforth2020')]

You are also required to specify the audio filename (filename), the directory path (dir_path), and a descriptive name for where the data is physically stored (location). The last two required fields are the sampling rate (sample_rate) and the number of samples (num_samples). The remaining fields are all optional.

Let us try to add a single audio file to the database, linking it to the first deployment and providing only the minimum required metadata.

In [11]:

data = {
    "deployment_id": 1,
    "filename": "a-10-second-long-recording.wav",
    "dir_path": "/home/mrmoose/acoustic-data/",
    "location": "personal laptop",
    "sample_rate": 32000,  #sampling rate in samples/s
    "num_samples": 320000  #file size, no. samples
}

cursor = kdb.insert_row(conn, table_name="file", values=data)

This works well for a single audio file, but if your dataset consists of many files entering metadata for file manually quickly becomes tedious. In such cases, you may want to automate the metadata collection. In particular, you may want to automatically

generate a list of all the audio files in a given directory
parse timestamps from the filenames
inspect the files to determine their sampling rate and size

Fortunately, Korus has a function that does exactly this. Let's see how it works.

In [12]:

# define a function for parsing the timestamps embedded in the audio filenames
# this function should take the filename as input (in the form of a string) and 
# return the UTC start time of the file (in the form of a datetime object)
def timestamp_parser(x):
    fmt = "%Y%m%dT%H%M%S.%f"
    p = x.rfind(".")
    s = x[p-20: p-1]
    return datetime.strptime(s, fmt)

# now let's collect metadata on all the FLAC files in the .data/ folder (including subfolders)
df = collect_audiofile_metadata("./data", "FLAC", timestamp_parser)

# view the contents of the returned dataframe
print(df.to_string())

  format                start_utc  num_samples  sample_rate                         filename  dir_path
0   FLAC  2013-06-23 08:00:00.116      1800055         1000  audio_20130623T080000.116Z.flac  20130623
1   FLAC  2013-06-23 08:30:00.117      1800055         1000  audio_20130623T083000.117Z.flac  20130623
2   FLAC  2013-06-24 08:00:00.118      1800055         1000  audio_20130624T080000.118Z.flac  20130624

The collect_audiofile_metadata function returns the metadata in the form of a pandas DataFrame. Having reviewed the metadata, we are now ready to submit the data to the database. We do this one entry at the time,

In [13]:

for _,row in df.iterrows(): #loop over all rows in the dataframe
    data = row.to_dict() #convert pd.Series to dict

    # additional required metadata
    data["deployment_id"] = 2 
    data["location"] = "personal laptop"
    
    # insert in the 'file' table
    cursor = kdb.insert_row(conn, table_name="file", values=data)

Let's check that things look alright by querying for the deployment ID, filename, and UTC start times of all the audio files in the database,

In [14]:

cursor = cursor.execute("SELECT deployment_id, filename, start_utc FROM file")
print(cursor.fetchall())

[(1, 'a-10-second-long-recording.wav', None), (2, 'audio_20130623T080000.116Z.flac', '2013-06-23 08:00:00.116'), (2, 'audio_20130623T083000.117Z.flac', '2013-06-23 08:30:00.117'), (2, 'audio_20130624T080000.118Z.flac', '2013-06-24 08:00:00.118')]

Happy with the changes made, we commit them to the database and close the connection.

In [15]:

conn.commit()
conn.close()

Adding a taxonomy ¶

As always, we begin by reconnecting to the database.

In [16]:

conn = sqlite3.connect(path_db)

Next, we import the annotation taxonomy that we created in the 1st tutorial,

In [17]:

kdb.import_taxonomy(conn, src="tax_t1.sqlite", name="my-first-taxonomy")

Out[17]:

<sqlite3.Cursor at 0x7f97a760df80>

To verify that the taxonomy was imported into our database, we form the following SQLite query,

In [18]:

cursor = conn.execute("SELECT id,name,version FROM taxonomy")
print(cursor.fetchall())

[(1, 'my-first-taxonomy', '1'), (2, 'my-first-taxonomy', '2')]

Let us retrieve the 1st version of the taxonomy to inspect its node structure,

In [19]:

tax = kdb.get_taxonomy(conn, taxonomy_id=1)

In [20]:

tax.show(append_name=True)  #sound sources

Unknown
├── Anthro [Anthropogenic]
│   └── Boat
│       ├── Engine
│       └── Prop [Propeller]
└── Bio [Biological]
    └── Whale
        ├── HW [Humpback whale]
        └── NARW [North Atlantic right whale]

In [21]:

tax.sound_types("NARW").show(append_name=True)  #sound types for NARW

Unknown
├── GS [Gun shot]
└── TC [Tonal call]
    ├── FU [Faint Upcall]
    └── LU [Loud Upcall]

Finally, we remember to commit the changes and close the database connection.

In [22]:

conn.commit()
conn.close()

Adding an annotation job ¶

As always, we begin by reconnecting to the database,

In [23]:

conn = sqlite3.connect(path_db)

The Korus schema can accomodate annotations jobs performed by humans as well as machines/algorithms. In this tutorial, however, we shall limit ourselves to human annotation jobs.

Let us inspect all the columns of the job table,

In [24]:

cursor = conn.execute("SELECT name,type FROM PRAGMA_TABLE_INFO('job')")
columns = cursor.fetchall()
print(columns)

[('id', 'INTEGER'), ('taxonomy_id', 'INTEGER'), ('model_id', 'INTEGER'), ('annotator', 'TEXT'), ('primary_sound', 'JSON'), ('background_sound', 'JSON'), ('is_exhaustive', 'INTEGER'), ('model_config', 'JSON'), ('start_utc', 'TEXT'), ('end_utc', 'TEXT'), ('by_human', 'INTEGER'), ('by_machine', 'INTEGER'), ('comments', 'TEXT')]

For human-made annotations, you are advised to fill in the following fields,

taxonomy_id: The identifier of the taxonomy version used to annotate sounds
primary_sound: Sounds (source,type) that were systematically annotated
background_sound: Sounds that were annotated only opportunistically
is_exhaustive: Whether all primary sounds occurrences were annotated
annotator: The name or initials of the person who made the annotations
start_utc, end_utc: Time period during which the annotation job was performed

To be more concrete, let us assume that the acoustic analyst was tasked with annotating every occurrence of the (faint and loud) upcalls and gunshots made by the North Atlantic right whale, while only annotating anthropogenic sounds opportunistically. You would then specify

In [25]:

taxonomy_id = 1
primary_sound = [("NARW","FU"), ("NARW","LU"), ("NARW","GS")]
background_sound = ("Anthro","%")  # % can be used as wildcard
is_exhaustive = True

Let us now collect all the metadata in a dictionary and submit the entry to the database as follows,

In [26]:

# collect all the data in a dict
data = {
    "taxonomy_id": taxonomy_id,
    "annotator": "AB",
    "primary_sound": primary_sound,
    "background_sound": background_sound,
    "is_exhaustive": is_exhaustive,
    "start_utc": "2022-10",
    "end_utc": "2023-03",
    "comments": "Vessel noise annotated opportunistically"
}

# insert into the database
kdb.insert_job(conn, values=data)

Out[26]:

<sqlite3.Cursor at 0x7f97a75da730>

Before committing the changes and closing the connection to the database, let us add another annotation job. For this second job, we shall assume that the analyst was using version 2 of the taxonomy and only annotated particularly clear or nice examples of the NARW upcall.

In [27]:

# collect all the data in a dict
data = {
    "taxonomy_id": 2,
    "annotator": "CD",
    "primary_sound": ("NARW","Upcall"),
    "is_exhaustive": False,
    "start_utc": "2023-04-01",
    "end_utc": "2023-04-07",
    "comments": "Only annotated clear and nice upcalls"
}

# insert into the database
kdb.insert_job(conn, values=data)

Out[27]:

<sqlite3.Cursor at 0x7f97a75da8f0>

In [28]:

conn.commit()
conn.close()

Linking files to jobs ¶

Having specified who made the annotations, when they were made, and (most importantly) how they were made, there is still one important piece of information missing, namely, which audio files were annotated.

To retrieve the audio file IDs, we begin by querying the database for the relevant hydrophone deployment. Specifically, we search for deployments where NS or Nova Scotia appears in the location name and with data collected between 2019 and 2021.

In [29]:

# re-establish the connection to the database
conn = sqlite3.connect(path_db)

In [30]:

# Form SQLite query to search the database for deployments based on location name and time period
query = """
    SELECT 
        id
    FROM 
        deployment 
    WHERE 
        location LIKE '%NS%'
        OR location LIKE '%Nova Scotia%'
        AND start_utc >= '2019'
        AND end_utc <= '2021'
"""
c = conn.cursor()
rows = c.execute(query).fetchall()
print(rows)

[(2,)]

Our query returned one deployment matching the search criteria, with deployment ID = 2. Let us now retrieve the filenames and IDs of all the audio files associated with this deployment.

In [31]:

deployment_id = rows[0][0]

query = f"""
    SELECT 
        filename,id
    FROM 
        file
    WHERE 
        deployment_id = '{deployment_id}' 
"""
rows = c.execute(query).fetchall()
print(rows)

[('audio_20130623T080000.116Z.flac', 2), ('audio_20130623T083000.117Z.flac', 3), ('audio_20130624T080000.118Z.flac', 4)]

Finally, let us link these three audio files to first annotation job, i.e., the job performed by annotator AB in the previous section. Note that we only need the file IDs for this. The filenames were included in the query merely for illustration.

In [32]:

# link files to job, one file at the time
for row in rows:
    file_id = row[1]
    data = {
        "job_id": 1, 
        "file_id": file_id,
        "channel": 0  #recording channel (0,1,2,...)
    }
    kdb.insert_row(conn, table_name="file_job_relation", values=data)

Before we conclude this section, let us also link an audio file to the second annotation job,

In [33]:

data = {
    "job_id": 2, 
    "file_id": 1,
    "channel": 0
}
kdb.insert_row(conn, table_name="file_job_relation", values=data)

Out[33]:

<sqlite3.Cursor at 0x7f97a75da9d0>

In [34]:

# commit the changes and close the connection to the database
conn.commit()
conn.close()

Adding annotations ¶

We are now finally at the point where we can add annotations to the database. Although Korus contains functions for automatically ingesting annotation tables from CSV files or Excel spreadsheets, we shall first try to add annotations manually, as this will give us a better appreciation for the steps involved.

It is recommended to use the function add_annotations for adding annotations to the database. This function accepts a dictionary or a Pandas DataFrame as its input, with a specific set of keys or columns, as detailed in the Korus API Documentation.

In this example, we are given a CSV file with the two columns filename and call_time, specifying the names of the audio files and the within-file offsets (in seconds) of the acoustic signals identified by the acoustic analyst.

We shall further assume that the analyst was using version 1 of the annotation taxonomy, and that all the entries in the CSV file are instances of the 'Loud Upcall' (LU) attribute to North Atlantic right whales (NARW).

In [35]:

df = pd.read_csv("data/annotations.csv") #load data from CSV file into a Pandas DataFrame
print(df.to_string())

                           filename  call_time
0   audio_20130623T080000.116Z.flac   1128.840
1   audio_20130623T080000.116Z.flac   1153.526
2   audio_20130623T080000.116Z.flac   1196.778
3   audio_20130623T080000.116Z.flac   1227.642
4   audio_20130623T080000.116Z.flac   1358.181
5   audio_20130623T080000.116Z.flac   1437.482
6   audio_20130623T080000.116Z.flac   1489.288
7   audio_20130623T080000.116Z.flac   1511.670
8   audio_20130623T080000.116Z.flac   1530.595
9   audio_20130623T080000.116Z.flac   1536.580
10  audio_20130623T080000.116Z.flac   1714.372
11  audio_20130623T080000.116Z.flac   1768.251
12  audio_20130623T080000.116Z.flac   1777.835
13  audio_20130623T083000.117Z.flac     68.149
14  audio_20130623T083000.117Z.flac    688.507
15  audio_20130623T083000.117Z.flac    755.940
16  audio_20130623T083000.117Z.flac    770.440
17  audio_20130624T080000.118Z.flac     68.853
18  audio_20130624T080000.118Z.flac    105.927
19  audio_20130624T080000.118Z.flac   1057.015
20  audio_20130624T080000.118Z.flac   1067.282
21  audio_20130624T080000.118Z.flac   1290.563
22  audio_20130624T080000.118Z.flac   1378.955
23  audio_20130624T080000.118Z.flac   1428.648
24  audio_20130624T080000.118Z.flac   1663.622
25  audio_20130624T080000.118Z.flac   1676.682

In order to submit these annotations to the database, we will need to look up the file identifiers. We will also need to assign each annotation a start time and a duration, in place of the call_time which corresponds roughly to the midpoint of the call. Since the duration is unknown, we will be using a fixed-sized window of 3.0 seconds, fully sufficient to capture the entire upcalls which typically are only 1 second long.

Let's begin by looking up the file identifiers and add them to the data frame.

In [36]:

# list the unique filenames that appear in the annotation table
filenames = df["filename"].unique().tolist()
print(filenames)

['audio_20130623T080000.116Z.flac', 'audio_20130623T083000.117Z.flac', 'audio_20130624T080000.118Z.flac']

In [37]:

# re-establish the connection to the database
conn = sqlite3.connect(path_db)
c = conn.cursor() # get cursor

In [38]:

# query the 'file' table for files with the correct filename and retrieve their ID
file_ids = dict()
for filename in filenames:
    query = f"SELECT id FROM file WHERE filename = '{filename}'"
    file_id = c.execute(query).fetchall()[0][0]
    file_ids[filename] = file_id # create a dictionary mapping: filename -> file_id
    
print(file_ids)    

{'audio_20130623T080000.116Z.flac': 2, 'audio_20130623T083000.117Z.flac': 3, 'audio_20130624T080000.118Z.flac': 4}

In [39]:

# finally, let's add a column to our data frame with the file IDs
df["file_id"] = df["filename"].apply(lambda x: file_ids[x])

print(df)

                           filename  call_time  file_id
0   audio_20130623T080000.116Z.flac   1128.840        2
1   audio_20130623T080000.116Z.flac   1153.526        2
2   audio_20130623T080000.116Z.flac   1196.778        2
3   audio_20130623T080000.116Z.flac   1227.642        2
4   audio_20130623T080000.116Z.flac   1358.181        2
5   audio_20130623T080000.116Z.flac   1437.482        2
6   audio_20130623T080000.116Z.flac   1489.288        2
7   audio_20130623T080000.116Z.flac   1511.670        2
8   audio_20130623T080000.116Z.flac   1530.595        2
9   audio_20130623T080000.116Z.flac   1536.580        2
10  audio_20130623T080000.116Z.flac   1714.372        2
11  audio_20130623T080000.116Z.flac   1768.251        2
12  audio_20130623T080000.116Z.flac   1777.835        2
13  audio_20130623T083000.117Z.flac     68.149        3
14  audio_20130623T083000.117Z.flac    688.507        3
15  audio_20130623T083000.117Z.flac    755.940        3
16  audio_20130623T083000.117Z.flac    770.440        3
17  audio_20130624T080000.118Z.flac     68.853        4
18  audio_20130624T080000.118Z.flac    105.927        4
19  audio_20130624T080000.118Z.flac   1057.015        4
20  audio_20130624T080000.118Z.flac   1067.282        4
21  audio_20130624T080000.118Z.flac   1290.563        4
22  audio_20130624T080000.118Z.flac   1378.955        4
23  audio_20130624T080000.118Z.flac   1428.648        4
24  audio_20130624T080000.118Z.flac   1663.622        4
25  audio_20130624T080000.118Z.flac   1676.682        4

Now, let's assign start times and durations for every annotation:

In [40]:

# compute the annotation start time in milliseconds
df["start_ms"] = (df["call_time"] - 1.5) * 1e3

# set the duration to 3.0 seconds
df["duration_ms"] = 3000

# cast both as integers (the type used for storing these quantities in the database)
df["start_ms"] = df["start_ms"].astype("int")
df["duration_ms"] = df["duration_ms"].astype("int")

# drop the - now obsolete - filename and call_time columns
df = df.drop(columns=["filename","call_time"])

# specify the 'granularity' of each annotation
df["granularity"] = "window"

print(df)

    file_id  start_ms  duration_ms granularity
0         2   1127340         3000      window
1         2   1152026         3000      window
2         2   1195278         3000      window
3         2   1226142         3000      window
4         2   1356681         3000      window
5         2   1435982         3000      window
6         2   1487788         3000      window
7         2   1510170         3000      window
8         2   1529095         3000      window
9         2   1535080         3000      window
10        2   1712872         3000      window
11        2   1766751         3000      window
12        2   1776335         3000      window
13        3     66649         3000      window
14        3    687007         3000      window
15        3    754440         3000      window
16        3    768940         3000      window
17        4     67353         3000      window
18        4    104427         3000      window
19        4   1055515         3000      window
20        4   1065782         3000      window
21        4   1289063         3000      window
22        4   1377455         3000      window
23        4   1427148         3000      window
24        4   1662122         3000      window
25        4   1675182         3000      window

Finally, we must add columns with the appropriate labels for the sound source and sound type,

In [41]:

df["sound_source"] = "NARW"
df["sound_type"] = "LU"

We are now ready to submit the annotations to the database.

In [42]:

annot_ids = kdb.add_annotations(conn, annot_tbl=df, job_id=1)

Note that the add_annotations returns the indices of the annotations just added.

Another useful feature of Korus is the add_negatives function, which automatically generates annotations for the 'quiet' periods, i.e., periods during which no 'primary sounds' were heard.

In [43]:

neg_ids = kdb.add_negatives(conn, job_id=1)

Content that the annotations have been inserted into our database, we commit the changes and close the connection,

In [44]:

conn.commit()
conn.close()

Customization ¶

It is not recommended to remove existing columns or tables from the Korus database as this may compromise core functionalities of the Korus API. However, you are more than welcome to extend the database schema. To do so, you will have to interact directly with the database using SQLite syntax. For example, adding a column to an existing table is as easy as,

In [45]:

# open a connection to the database
conn = sqlite3.connect(path_db)

# add a column named 'hydrophone_color' to the deployment table with default value 'yellow'
conn.execute("ALTER TABLE deployment ADD COLUMN hydrophone_color TEXT DEFAULT 'yellow'")

# commit changes and close
conn.commit()
conn.close()