Data model and storage

The HMC FAIR Data Dashboard utilizes a relational database (specifically MariaDB (opens in a new tab)) to ensure the persistence of stored data. To enhance data integrity and minimize redundancy, the database has been normalized following standard conventions. The resulting structure aims to eliminate anomalies and improve overall data quality.

For a visual representation, refer to the Enhanced Entity-Relationship (EER) diagram below:

Enhanced Entity-Relationship (EER) diagram

Database Schema

The database schema is accessible for utilisation in mariadb/schema.sql file.

Simplified Deployment of the Dashboard in Kubernetes (opens in a new tab)

The database connection and configuration variables are initially transmitted via an environment variable.

The dashboard and database are discrete entities, and this functionality facilitates the straightforward deployment of the dashboard within a Kubernetes environment.

Preparation of data for presentation in the Dashboard

In order to gain access to the data, SQL queries have been written in the utility/db.py file. Furthermore, the pymysql python driver is utilised for this purpose. The following steps must be followed in order to access the data:

1. Connecting to the database

In order to establish a connection with the database, our connection function has been defined as follows:

pages/utility/db.py

def connection():
    return pymysql.connect(
        host=DB_CONTAINER_NAME_OR_ADDRESS,
        user=DB_USER,
        password=DB_PASSWORD,
        database=DB_NAME,
    )

2. Defining the SQL queries

The subsequent stage of the process is to define the SQL query, for instance:

pages/utility/db.py

SELECT p.publication_year as 'Publication Year',
p.`type` as 'Publication Type', COUNT(DISTINCT p.id) as 'Total Number of Datasets'
FROM publication AS p
INNER JOIN reference AS r ON p.id = r.reference_to_id
INNER JOIN publication AS pr ON pr.id = r.reference_id
INNER JOIN center_has_publication AS pc ON pr.id = pc.publication_id
INNER JOIN center AS c ON pc.center_id = c.id
WHERE p.`type` = 'Dataset' AND  r.`sub_type` = 'IsSupplementedBy'  AND  p.publication_year >= 2000
AND p.publication_year <= %s
GROUP BY p.publication_year
ORDER BY p.publication_year

3. Executing the queries

3.1. Function to execute

The following function enables the execution of queries and the storage of the resulting data in a data frame, which can then be accessed and utilised as required via the dashboard.

pages/utility/db.py

def get_dataframe(
query, args=None, success_message="Data Selection Successfully Done!"
):
conn = connection()
cursor = conn.cursor()
cursor.execute(query, args)
data = cursor.fetchall()
dataframe = pd.DataFrame(data, columns=[i[0] for i in cursor.description])
try:
    conn.commit()
    cursor.close()
    print(success_message)
    return dataframe
 
except pymysql.Error as err:
    print(f"could not close connection error pymysql {err.args[0]}: {err.args[1]}")
    conn.rollback()
    cursor.close()
    return None

3.2. Executing the function

The following illustrates the calling of the function that will execute the query and return a data frame, which can be utilized across the entire dashboard.

pages/utility/db.py

data_count_df = get_dataframe(
SELECT_DATASET_PUBLICATION_COUNT_QUERY,
args=[MAX_YEAR],
success_message="DATASET PUBLICATION COUNT queried successfully",
)

The objective has been to reuse the queried data as much as possible throughout the dashboard.

System Architecture General Layout