1. Introduction and Goals

Shepard is a multi-database storage system for highly heterogenous research data. It provides a consistent API to deposit or access any type of supported data. It is a platform to work with experiment data towards publication.

With the expansion of Shepard, we are creating a more accessible data management platform for research data in order to conduct excellent, data-based research at DLR and beyond.

With the expansion of Shepard, we are increasing the usability, scalability and customizability of the platform to further expand its use, continue to build the community and promote research according to the FAIR principles.

1.1. Quality Goals

Quality goals are prioritized from top to bottom.

  1. Usability: The interface should be intuitive to use and provide users with the best possible support in their work. It should be fun to use the interface. The user must be able to find corresponding data easily.

  2. Reliability: During an experiment the transferred data must not be lost. Software updates are not allowed to break data.

  3. Maintainability: Changes and extensions to the software should be possible efficiently and cost-effectively.

  4. Performance: The system must be able to handle large volumes of data efficiently.

  5. Operability: It should be easy to get the system up and running. Same for configuring and updating the system.

2. Architecture Constraints

  • On-Premises: The system must be operational without accessing the internet. No cloud services allowed.

  • Respect FAIR principles: Data must meet the following principles: findability, accessibility, interoperability and reusability. The actual implementation in the project still needs to be clarified.

  • Integration ability: While shepard holds the data, analysis is done by different tools. Shepard has to provide a REST interface for accessing those data.

  • Open Source: In general, only software licences are allowed that comply with the Open Source definition - in brief, they allow software to be freely used, modified and shared.

  • Existing data must continue to be usable: Data must not be lost and must remain accessible, especially after software updates. Breaking changes are allowed but there must be a migration strategy if necessary.

  • Software must be operational in the DLR environment:

    • no access to the internet

    • DNS is not available everywhere

  • Responsiveness:

    • shepard works well on desktop screens from 14 to 24 inches with 1080p resolution, also on half window size

    • shepard works well on tablets

    • shepard is not optimized for mobile devices

  • Browser support: shepard supports at least Firefox ESR and the latest version of Microsoft Edge

  • Accessibility: The basic features should be implemented like high contrast and big font sizes. No special features like support for screen readers are needed.

3. System Scope and Context

This chapter is still under construction

3.1. Business Context

plantUMLbusiness context

3.2. Technical Context

TDB

3.3. Users and Roles

User/Role Description

Administrator

The administrator sets up and configures a shepard instance.

Researcher

Researcher are using the system as data sink for external data sources. They are running experiments and link data that belongs together. They use the data for further analysis.

3.4. Use Cases

plantUMLuse cases

3.4.1. Create Collection

Collections are meant as the root or container of your entire structure. All information, data, files, permissions, etc. are related to exactly one collection. Usually, a collection is the first thing you create.

3.4.2. Create Project Structure

In order to organize your project, you use Data Objects. They are very generic and can be used to organize your work as you wish. You can use them to create a tree based on experiments, lots, process steps or whatever you want.

3.4.3. Create Container

There are different types of container available, e.g. for files, structured data, timeseries or semantic repositories. You can use them to group things together. If you want to store some images of a video camera, you can create a file container and upload them. If you need to store coordinates of a robot movement, you can create a timeseries containers and store the data there.

3.4.4. Use Lab Journal

The Lab Journal enables researchers to document their experiments, data and results. Lab Journal entries can be linked to any Data Object. They also support html content, so that basic formatting and structuring can be used.

4. Solution Strategy

This chapter is still under construction

Quality Goal Scenario Solution Approach Link to Details

Usability

Reliability

Maintainability

Performance

Operability

4.1. Backend (Quarkus)

4.1.1. Modularization

Previously, the modules have been organized along their technical purpose. That means that there was one package for all endpoints, one package for all neo4j entities, one package for all DAOs etc. This leads to logic and domain objects that are closely coupled being far away from each other, making both changes very distributed across the code base and the code of one feature hard to grasp.

To mitigate this, we decided on a new modularization strategy, that will be applied with each update or refactoring touching a part of the code. The first example is the new timeseries module under de.dlr.shepard.timeseries. The following is a description of this target modularization.

The backend is split into modules by their functionality. That means that there is e.g. a module for timeseries including managing containers and timeseries data, a module for managing files, for collections & data objects etc.

Each of those modules contains all technical components needed to fulfill it’s purpose, including endpoints, services, domain model objects, DAOs, entities and repositories.

Additionally, there may be modules for special functionalities like authentication.

4.1.2. REST Endpoints

We use quarkus-rest to define REST endpoints.

For authentication and general request validation, we use filters following the Jakarta REST way. The filters can be found at filters/. In general, all requests need to be done by authenticated users. The JWTFilter and the UserFilter take care of validating authentication.

Some endpoints should be public, for example /healthz and /versionz. To make this possible we use the PublicEndpointRegistry class. In this class we register all public endpoints in a static string array. The authentication filters will be bypassed for endpoints in this array. Since the /healthz endpoint is automatically public thanks to the SmallRye Health extension, we don’t need to add it to the PublicEndpointRegistry.

4.2. Frontend (Nuxt)

4.2.1. Structure

Each route is defined in pages/. The root file of a route itself should not contain a lot of logic, instead it should invoke one or more components.

In components/ all components are stored. These are grouped in folders by domain, e.g. components/collections.

Stateful logic can be extracted into composables and are stored in composables/.

Stateless utility functions should be stored under utils/.

Routing

We aim to make routes as understandable as possible and try to make each resource view have a unique URL in order to be able to directly link to resources.

This means that for example the collections list view is at /collections. One collection can be found at /collections/:id. Since data objects belong to a collection and share a common side bar menu with their collection, they can be found at /collections/:id/dataobjects/:id.

In order to navigate users to another page we aim to avoid:

  • javascript only navigations like router.push(…​) to make users able to see where they will be redirected

  • standard href links to avoid rerendering the whole page

Instead, we use NuxtLink as much as possible. This enables a url display on hover as well as client-side navigations in a hydrated page.

4.2.2. Backend Interaction

For the interaction with the backend we check a generated openapi client into the repository, see here for more information.

In order to directly instantiate the API clients with our default configuration, we use the createApiInstance utility.

4.3. Technology Choices

4.3.1. Using Node LTS

The current LTS version of Node.js should be used in all of the project (Gitlab jobs, docker images, local development, etc.) to maintain reliability and performance.

4.3.2. Using Quarkus LTS

The current LTS version of Quarkus should be used in our backend to maintain reliability and performance.

5. Building Block View

This chapter is still under construction

5.1. Whitebox Overall System

plantUMLwhitebox overview

The backend is designed to be modular and expandable. One of the main tasks of the backend is the management of metadata. The data structure is used to manage the attached data. This structure as well as the corresponding references is managed by the core package and stored in Neo4j.

Contained Blackboxes:

Building Block Responsibility

Backend

Handles incoming requests from actors like the shepard Timeseries Collector, Webfrontend and Users. Allows these actors to upload, search and retrieve different kinds of data.

Webfrontend (UI)

Allows users to interact (upload, search, retrieve) with the stored data via a webapp.

External Services & Scripts

External services that interact with a shepard instance. For example the shepard Timeseries Collector (sTC). It is one of many tools in the shepard ecosystem and aims to collect data from different sources and stores them as timeseries data to shepard.

5.2. Level 2

plantUMLwhitebox backend

Multiple databases are used to enable CRUD operations for many types of data. The user can access these databases via the REST API.

Each database integration has to create its own data structure as needed. The general structure with Entities, References and IOEntities is available to all integrations. In order to create a new database integration, one needs to create a new package. This package has to contain at least one database connector instance, the necessary data objects and a service class. In addition, the corresponding REST endpoints and the respective references must be implemented.

Contained Blackboxes:

Building Block Responsibility

Authorization

This module handles user permissions and roles. Most of the endpoints are protected and can only be used by authenticated users or a valid API key.

Collections & DataObjects

Manages metadata and references of organizational elements like DataObjects, Collections and Containers. Consists of a connector to establish a connection to the Neo4j database, data access objects (dao) to create an interface between data objects and the Neo4j database, and their associated services to provide a higher level view on the database operations.

Timeseries (InfluxDB)

Manages timeseries data. Consists of a connector to handle the database and a service to provide a higher level view on the database operations.

Timeseries (TimescaleDB)

Manages timeseries data. Uses TimescaleDB instead of InfluxDB. In an experimental state right now. Will replace the old InfluxDB timeseries module.

Structured Data & Files

Manages structured data and file uploads. Consists of a connector to handle the database connection and two services to provide a higher-level view on the database operations for structured data and files.

Status

Contains a health and version endpoint that is accessible via REST. It is easy extensible to provide status information about the backend like the current state of database connections.

5.2.1. Level 3

Lab Journal
plantUMLcomponent lab journal

The Lab Journal module allows users to create, edit and delete journal entries. Lab Journal Entries can be used for documentation purposes. This feature is thought to be used via the frontend but can also be used directly via the REST interface if needed. Lab Journal entries are stored in the neo4j database. This module has a dependency to the Collections & DataObjects module because they are always linked to a DataObject. This module also has a dependency to the Authoriziation module because the user needs the correct permissions to see and edit lab journal entries.

plantUMLclasses lab journal

Following our solution strategy, we have the following classes:

  • LabJournalEntryRest contains the REST endpoints

  • LabJournalEntryIO is the data transfer object used in the REST interface

  • LabJournalEntryService contains the business logic

  • LabJournalEntry is the main business entity containing the content as html string

  • LabJournalEntryDAO is used for communication with the neo4j database

Timeseries (TimescaleDB)

This module will replace the old implementation using InfluxDB. For decisions and reasoning, check out ADR-008 Database Target Architecture, ADR-010 Postgres/Timescaledb Image and ADR-011 Timescale database schema.

The new module handles persisting timeseries data in a TimescaleDB. It includes all relevant endpoints and services. The database schema includes two tables:

  • A timeseries table containing the metadata for each timeseries (measurement, field, etc.) similar to the metadata in an Influx timeseries.

  • A hypertable containing the data points.

The schema for both tables is defined by the database migrations in src/main/java/resources/db/migration. The timeseries table is managed in the code using hibernate entities. The data point table is managed directly using custom queries, since we want to make full use of TimescaleDB features and performance.

6. Runtime View

7. Deployment View

7.1. Infrastructure Level 1

7.2. Infrastructure Level 2

8. Cross-cutting Concepts

8.1. Documentation Concept

8.1.1. Target Groups & Needs

Needs marked with (!) are not yet fulfilled, but well be taken into account in the future.

Target Group Needs

Researchers

  • Know how to explore the meta data structure and data using the shepard frontend (!)

  • Know how to use the API to retrieve data for analysis including authentication

  • Know how to store data in shepard via API or UI

  • Understand the Meta Data Model created by the Project manager and the data structure of shepard (collections, data objects & containers)

  • Information on new features and breaking changes (!)

Integrators

  • Understand containers to fill data into them

  • Know how to store data in shepard via API or UI

  • Information on new features and breaking changes (!)

Project Managers

  • Know how to use the frontend to create collections and meta data structures (!)

  • Information on the data model and how to interact with it and use it

  • May need information on new features (!)

Administrators

  • Instructions how to deploy, run and update a shepard instance

  • information on new features and breaking changes (also to know when to update and who to inform)

Backend/Frontend developers

  • “Getting Started”-Guide on how to setup a working dev environment

  • May need additional information on development guidelines or best practices

  • Information on Context, Constraints & Requirements

  • Information on Architecture, Deployment Setup, Decisions, Technical Debt

  • Open Issues, Bugs,

  • Roadmap (!)

Maintainers

  • Know how to release shepard

  • Know how to deploy shepard

  • Know previous decisions

  • Open Issues, Bugs,

  • Roadmap (!)

8.1.2. Documentation Artifacts

The following artifacts are provided as documentation of shepard:

Artifact Notes Link

Architectural Documentation

  • Follows the arc42 template.

  • Describes architectural aspects of shepard like constraints, requirements, building blocks, decisions and concepts

  • Includes documentation on the release process and updating dependencies

Wiki (Consumer Documentation)

Explains basic concepts relevant for using shepard. Also includes examples how to interact with shepard.

Release Notes

Contains information for each release of shepard.

OpenAPI Spec

The OpenAPI spec describes the REST API of shepard.

Administrator Documentation

Contains all relevant information for administrators to successfully operate a shepard instance.

CONTRIBUTING.md

Contains all relevant information on how to contribute to shepard, including:

  • How to setup a dev environment

  • Coding & code review guidelines

  • How code is integrated and reviewed

GitLab Issues

GitLab issues are used to track bugs, feature requests and todos for developers, including relevant discussions.

Unresolved directive in src/08_concepts/index.adoc - include::migrations.adoc[]

8.2. Authentication

We decided to rely on external identity providers with shepard. This allows us to use existing user databases such as Active Directories. In addition, we do not have to implement our own user database. Most shepard instances use keycloak as their identity provider. However, we want to be compatible with the OIDC specification so that other OIDC identity providers could work with shepard.

The JWTFilter, which filters every incoming request, implements authentication by validating the submitted JWT. For this purpose, the JWT is decoded with a statically configured public key. OIDC allows the key to be obtained dynamically from the identity provider. However, we decided that a static configuration is more secure and has practically no disadvantages. The attack vector we are trying to mitigate here is that an attacker gains access to the infrastructure and somehow injects their own public key, which shepard would accept from that point on.

If configured, the system also checks whether certain roles are present among the JWT’s realm_access.roles attribute. This can be done by configuring the variable OIDC_ROLES for the backend. The backend then only accepts JWTs with the specified role. This enables the reuse of existing identity providers for different shepard instances, each of which can be accessed by different user groups. For example, if someone uses an Active Directory for Keycloak to fetch users from, then Keycloak could add specific roles to people based on AD groups they belong to.

In addition to OIDC, we also allow authentication via API keys. Shepard generates these keys itself and stores them in our internal database. Although the API keys are also JWTs, we have to check whether the specified key can be found in our database. Otherwise, we would continue to accept keys that have already been deleted, which is not the intended behavior.

8.2.1. Using Nuxt-Auth

To be able to authenticate in the frontend and acquire a JWT token, we use @sidebase/nuxt-auth as our authentication module in the frontend.

Adjust Nuxt config

The file nuxt.config.ts holds configuration for the application.

That is where we need to add the @sidebase/nuxt-auth module to the modules array.

Then, in the same config object, populate the auth configuration.

The auth config holds details about our authentication provider and session refresh management.

Add the @sidebase/nuxt-auth module to the modules array and enable the auth configuration in nuxt.config.ts

export default defineNuxtConfig({
  modules: [
    "@sidebase/nuxt-auth",
    ...,
  ],

  auth: {
    isEnabled: true,
    provider: {
      type: "authjs",
      ...,
    },
    sessionRefresh: {...},
  },
  ...,
})

Details about the auth config attributes could be found in the docs

Add environment variables

A couple of env variables are needed for this to work. These variables are documented in the setup of the frontend.

To be able to make use of them we should list them in the runtimeConfig.

export default defineNuxtConfig({  runtimeConfig: {
  authSecret: "",
  oidcClientId: "",
  oidcIssuer: "",
  }
})
Configure the authentication provider

After the configuration adjustment mentioned previously, an automatically generated auth path is created /api/auth.

Which is where we should create our OIDC provider config under /src/server/api/auth/[…​].ts.

export default NuxtAuthHandler({
  secret: runtimeConfig.authSecret,
  providers: [
    {
      id: "oidc",
      name: "OIDC",
      type: "oauth",
      ...,
    },
  ],
})

Details about the provider config can be found in the NextAuth docs

After this setup we should be able to authenticate using the specified OIDC provider.

To handle token and session refresh we can use the jwt() and session() callbacks to control the behavior in the same NuxtAuthHandler.

8.3. Authorization

8.3.1. Requirements

  • The backend can verify the identity of users

  • Users are uniquely identified in the backend by usernames

  • The backend can easily verify whether a user has permissions to a particular object

  • This check is quick and easy to perform, so there is no noticeable delay

  • Current records can still be used

Owner
  • Objects have an unique owner

  • Objects without owners belong to everyone (backward compatibility)

  • Owners can be changed later

  • Owners automatically have all permissions to the object

  • Owners automatically have all permissions on all subordinate objects (inheritance)

  • Newly created objects belong to the creator unless otherwise specified

Permissions
  • There are different permissions for readability, writability and managability

  • Permissions can be set only for collections and containers, but apply to all subordinate objects

  • For each object, there is a list of users who are allowed to read/write/manage the object

  • The different permissions build upon each other (read < write < manage)

  • Permissions can be edited by all users with manage permissions

  • Collections and container can be created by all users with access

  • Newly created objects can be read and written by everyone with access

Long living access tokens (Api Keys)
  • Api Keys are used to authenticate and authorize a client for a specific task

  • Api Keys belong to one user

  • Api Keys can only authorize something as long as the user is allowed to do so

  • If a user no longer exists, his Api Keys are automatically invalidated

Payload databases
  • Creation of new data is allowed to any logged in user

  • Integrated databases contain payload containers represented by a container object in the data model

  • Users can create payload containers via the root endpoints

  • Containers can be populated with data via the type/container_id/ URL (e.g. /files/<id>/, /timeseries/<id>/)

  • Containers can be restricted by the permission system mentioned above

  • A reference contains the specific ID of the uploaded data inside the container

  • Multiple references can point to one and the same data, or narrow it down further

8.3.2. Implementation

Users
  • Users are stored in Neo4j

  • A user also has the following attributes (arrows → indicate relationships)

  • owned_by → List of entities, references and containers (n:1)

  • readable_by → List of entities (n:m)

  • writable_by → List of entities (n:m)

  • managable_by → List of entities (n:m)

Endpoints
  • A endpoint /…​/<id>/permissions can be used to manage permissions of an object

  • Allowed methods are GET and PUT

  • Permissions follow the following format:

{
   "readableBy": [
      <usernames>
   ],
   "writableBy": [
      <usernames>
   ],
   "managableBy": [
      <usernames>
   ],
   "ownedBy": <username>
}
Api Keys
  • Api Keys are stored in Neo4j

  • Each time an AccessToken is accessed, it must be checked that the owner of this token also has the corresponding authorization

  • Api Keys have the following attributes

  • uid: UUID

  • name: String

  • created_at: Date

  • jws: Hex String (Will never be delivered after creation)

  • belongs_to: User (n:1)

Access
  • For each access it must be checked whether

  • the user owns the requested object

  • the user is authorized for the requested object

  • the user is authorized to write to the requested payload object (if a payload database is accessed)

authFlow

8.3.3. Consequences

  • Subscriptions belong to a user, whose privileges are used to filter callbacks

  • Simple, concrete requests are allowed or forbidden

  • Callbacks must be checked on a case-by-case basis

  • Results of search queries must be filtered

8.3.4. Open Issues

  • Groups/Roles (OpenID Connect, internal, …​)

  • Restrict Api Keys to specific entities

  • Api Keys can expire

8.4. User Information

Shepard needs to know certain information about the current user, such as the first and last name and e-mail address. We can retrieve some information from the given JWT, as Keycloak usually adds some information there. However, most of the fields are not required by the specification, so we have to use other measures to get the required information. OIDC specifies a UserinfoEndpoint which can be used to retrieve some data about the current user. We have implemented a UserinfoService to access this data. Each time a user sends a request, the UserFilter fetches the relevant user information from the identity provider and updates the internally stored data if necessary. To reduce the number of requests, we have implemented a grace period during which no new data is retrieved.

8.5. Dependency Updates

Dependencies of shepard are updated regularly. To automate most of this, we use renovate in GitLab. The configuration for the repository is located at renovate.json. In order for the config to be active, it has to be present at the default branch of the repository (main). The renovate runner is located in this (private) repository: https://gitlab.com/dlr-shepard/renovate-runner.

The developer team is responsible of regularly handling the pull requests opened by renovate. This should happen once a month directly after creating a monthly release. As a reminder, monthly update tickets are part of the sprints.

8.5.1. Performing Updates

We handle the merge requests opened by renovate by performing the following steps for each update:

  1. reading the change logs of the dependency

  2. testing if everything is still working

  3. applying necessary changes if they are not too much effort

  4. merging the branch or suspending the update.

Also, the dependencies in package-lock.json should be updated. This is done by running npm update in the top level directory.

8.5.2. Suspending an Update

In case we could not perform the update it should be suspended and documented in the list of suspended updates. The reason can either be too much effort (we create a new story for that update) or that the update is not possible or feasible right now.

This can be done by excluding the library or specific version in the renovate config. Afterwards the config needs to be merged to main with the following commands:

git checkout main
git cherry-pick <commit-hash>

Afterwards the merge request can be closed.

8.5.3. Abandoned dependency updates

Sometimes, when the configuration changes or dependency updates were done without renovate, the bot might abandon a merge request. In this case the merge request is not automatically closed and has to be closed manually. The corresponding branch must also be deleted manually to keep things clean.

Use Only LTS updates for Quarkus

In our technology choices we decided to only rely on LTS updates of Quarkus. Currently, there is no way to advice the renovate bot to respect only LTS releases of Quarkus. Therefore, we have to manually check that the Quarkus update is always the latest LTS release. We do not want to update to non-LTS versions of Quarkus. A list of current Quarkus releases can be found here Quarkus releases.

Suspended Updates
Package and version Issue that blocks an update

tomcat<11

v11 is still a pre-release

influxdb⇐1.8

V2 introduces major breaking changes. Since we want to move to timescaleDB anyways, we disregard any new updates that require some kind of migration effort.

chronograf<1.10

The container cannot be started with v1.10. We expect to move away from influxdb in the future, so we will stick with v1.9 for the time being.

neo4j<5

V5 introduces major breaking changes

mongo<5

no real reason, there are some major changes, but nothing serious

vue<3

not compatible with bootstrap v4

vue-router<4

not compatible with vue v2

vuex<4

not compatible with vue v2

bootstrap<5

v5 has no vue integration

portal-vue<3

needed for bootstrap-vue

typescript<5

not compatible with vue v2

@vue/tsconfig<2

not compatible with vue v2

vue-tsc<2.0.24

eslint<9

--ext option is not supported and the current eslintrc file structure is not supported anymore.

@vue/eslint-config-prettier<10

Has peer dependency to current version of eslint

@vue/eslint-config-typescript<14

Has peer dependency to current version of eslint

neo4j-ogm<4

v5 is not compatible with neo4j v4

jjwt<0.12

v0.12.x introduces a series of breaking changes in preparation for v1.0. It is recommended to stay on v0.11 until v1.0 is finished to fix all changes at once.

junit-jupiter<5.11

Not possible atm because parametrized tests in combination with CsvSource do not work any longer. We will wait for the next version.

@vueuse/core<12 (old frontend)

v12 drops support for Vue v2

vite<6 (old frontend)

Peer dependency to @vueuse/core v12 (reason in the line above)

versions-maven-plugin<2.18

Maven report fails to generate in pipeline job

license-maven-plugin<2.5

Maven report fails to generate in pipeline job

8.6. Export Collections

The export feature exports an entire collection including all data objects, references and referenced payloads to a zip file. Metadata is added in the form of a ro-crate-metadata.json file as per the Research Object Crate specification.

{
  "@context": "https://w3id.org/ro/crate/1.1/context",
  "@graph": [
    {
      "name": "Research Object Crate",
      "description": "Research Object Crate representing the shepard Collection",
      "@id": "./",
      "@type": "Dataset",
      "hasPart": [
        ...
      ]
    },
    {
      "about": {
        "@id": "./"
      },
      "conformsTo": {
        "@id": "https://w3id.org/ro/crate/1.1"
      },
      "@id": "ro-crate-metadata.json",
      "@type": "CreativeWork"
    },
    ...
  ]
}

The zip file contains all files on top level. This conforms to both, the RoCrate specification and to our internal structure. Relationships between elements are written down as metadata.

<RO-Crate>/
 | ro-crate-metadata.json
 | DataObject1
 | Reference2
 | Payload3
 | ...

Organizational elements are added as json files, as they would also be displayed via the rest api. These files are named according to their shepard ID. This ensures that the files are unique. Payloads are added as they are, time series are exported in the corresponding CSV format. For each exported part there is an object in the file ro-crate-metadata.json with additional metadata. We use the field additionalType to specify the respective data type of an organizational element.

{
  "name": "DataObject 1",
  "encodingFormat": "application/json",
  "dateCreated": "2024-07-02T06:41:19.813",
  "additionalType": "DataObject",
  "@id": "123.json",
  "author": {
    "@id": "haas_tb"
  },
  "@type": "File"
}

Shepard also adds the respective authors to the metadata.

{
  "@id": "haas_tb",
  "email": "tobias.haase@dlr.de",
  "givenName": "Tobias",
  "familyName": "Haase",
  "@type": "Person"
}

8.7. Ontologies

8.7.1. Registries

BARTOC knows about terminology registries, including itself. Registries also provide access to full terminologies either via an API (terminology service) or by other means (terminology repository).

Typical "interfaces":

  • sparql

  • jskos

  • ontoportal

  • webservice

  • ols

  • skosmos

(others could in include eclass or IEEE iris)

8.7.2. Semantic Repository

  • GET, POST …​/semanticRepository/

  • GET, PUT, DELETE …​/semanticRepository/{containerId}

{
  "id": 123,
  "name": "Ontobee",
  "sparql-endpoint": "http://www.ontobee.org/sparql"
}

8.7.3. Semantic Annotation

  • GET, POST …​/collections/{collectionId}/annotations/

  • GET, PUT, DELETE …​/collections/{collectionId}/annotations/{annotationId}

  • GET, POST …​/collections/{collectionId}/dataObjects/{dataObjectId}/annotations/

  • GET, PUT, DELETE …​/collections/{collectionId}/dataObjects/{dataObjectId}/annotations/{annotationId}

  • GET, POST …​/collections/{collectionId}/dataObjects/{dataObjectId}/references/{referenceId}/annotations/

  • GET, PUT, DELETE …​/collections/{collectionId}/dataObjects/{dataObjectId}/references/{referenceId}/annotations/{annotationId}

{
  "id": 456,
  "propertyRepositoryId": 123,
  "property": "http://purl.obolibrary.org/obo/UO_0000012",
  "valueRepositoryId": 123,
  "value": "http://purl.obolibrary.org/obo/RO_0002536"
}

8.7.4. Ideas

Ontologies of interest
References / Examples of semantic annotation in other systems
    <annotation>
        <propertyURI label="is about">http://purl.obolibrary.org/obo/IAO_0000136</propertyURI>
        <valueURI label="grassland biome">http://purl.obolibrary.org/obo/ENVO_01000177</valueURI>
    </annotation>

8.8. Search Concept

8.8.1. Structured Data

Query documents using native mongoDB mechanics

  1. Receiving search query via POST request

    {
      "scopes": [
        {
          "collectionId": 123,
          "dataObjectId": 456,
          "traversalRules": ["children"]
        }
      ],
      "search": {
        "query": {
          "query": "{ status: 'A', qty: { $lt: 30 } }"
        },
        "queryType": "structuredData"
      }
    }
  2. Find all relevant references (children of dataObject with id 456)

  3. Find references containers

  4. Build query

    db.inventory.find( {"_id": $in: [ list of containers from 3 ] (implicit AND by), <user query>})
  5. Query mongoDB (4)

  6. Return results

    {
      "resultSet": [
        {
          "collectionId": 123,
          "dataObjectId": 456,
          "referenceId": 789
        }
      ],
      "search": {
        "query": {
          "query": "{ status: 'A', qty: { $lt: 30 } }"
        },
        "queryType": "structuredData"
      }
    }

8.8.2. Files

tbd

8.8.3. Timeseries

tbd

8.8.4. MetaData

needs MetaData Reference, tbd

8.8.5. Organizational Elements

Query collections, data objects and references

Query objects

The query object consists of logical objects and matching objects. Matching objects can contain the following attributes:

  • name (String)

  • description (String)

  • createdAt (Date)

  • createdBy (String)

  • updatedAt (Date)

  • updatedBy (String)

  • attributes (Map[String, String])

The following logical objects are supported:

  • not (has one clause)

  • and (has a list of clauses)

  • or (has a list of clauses)

  • xor (has a list of clauses)

  • gt (greater than, has value)

  • lt (lower than, has value)

  • ge (greater or equal, has value)

  • le (lower or equal, has value)

  • eq (equals, has value)

  • contains (contains, has value)

  • in (in, has a list of values)

{
  "AND": [
    {
      "property": "name",
      "value": "MyName",
      "operator": "eq"
    },
    {
      "property": "number",
      "value": 123,
      "operator": "le"
    },
    {
      "property": "createdBy",
      "value": "haas_tb",
      "operator": "eq"
    },
    {
      "property": "attributes.a",
      "value": [1, 2, 3],
      "operator": "in"
    },
    {
      "OR": [
        {
          "property": "createdAt",
          "value": "2021-05-12",
          "operator": "gt"
        },
        {
          "property": "attributes.b",
          "value": "abc",
          "operator": "contains"
        }
      ]
    },
    {
      "NOT": {
        "property": "attributes.b",
        "value": "abc",
        "operator": "contains"
      }
    }
  ]
}
Procedure
  1. Receiving search query via POST request

    {
      "scopes": [
        {
          "collectionId": 123,
          "dataObjectId": 456,
          "traversalRules": ["children"]
        }
      ],
      "search": {
        "query": {
          "query": "<json formatted query string (see above)>"
        },
        "queryType": "organizational"
      }
    }
  2. Find all relevant elements (here the nodes with IDs 1, 2 and 3)

  3. Build query

    MATCH (n)-[:createdBy]-(c:User) WHERE ID(n) in [1,2,3]
      AND c.username = "haas_tb"
      AND n.name = "MyName"
      AND n.description CONTAINS "Hallo Welt"
      AND n.`attributes.a` = "b"
      AND (
        n.createdAt > date("2021-05-12") OR n.`attributes.b` CONTAINS "abc"
      )
    RETURN n
  4. Query neo4j (3)

  5. Return results

    {
      "resultSet": [
        {
          "collectionId": 123,
          "dataObjectId": 456,
          "referenceId": null
        }
      ],
      "search": {
        "query": {
          "query": "<>"
        },
        "queryType": "organizational"
      }
    }

8.8.6. User

  1. Receiving search query via GET request /search/users

  2. Possible query parameters are username, firstName, lastName, and email

  3. Build query to enable regular expressions

    MATCH (u:User) WHERE u.firstName =~ "John" AND u.lastName =~ "Doe" RETURN u
  4. Query neo4j (2)

  5. Return results

    [
      {
        "username": "string",
        "firstName": "string",
        "lastName": "string",
        "email": "string",
        "subscriptionIds": [0],
        "apiKeyIds": ["3fa85f64-5717-4562-b3fc-2c963f66afa6"]
      }
    ]

8.8.7. OpenAPI Spec

openapi: 3.0.2
info:
  title: FastAPI
  version: 0.1.0
paths:
  /search/:
    post:
      summary: Search
      operationId: search_search__post
      requestBody:
        content:
          application/json:
            schema:
              $ref: "#/components/schemas/SearchRequest"
        required: true
      responses:
        "200":
          description: Successful Response
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/SearchResult"
        "422":
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/HTTPValidationError"
components:
  schemas:
    HTTPValidationError:
      title: HTTPValidationError
      type: object
      properties:
        detail:
          title: Detail
          type: array
          items:
            $ref: "#/components/schemas/ValidationError"
    Query:
      title: Query
      required:
        - query
      type: object
      properties:
        query:
          title: Query
          type: string
    QueryType:
      title: QueryType
      enum:
        - structuredData
        - timeseries
        - file
      type: string
      description: An enumeration.
    Result:
      title: Result
      required:
        - collectionId
        - dataObjectId
        - referenceId
      type: object
      properties:
        collectionId:
          title: Collectionid
          type: integer
        dataObjectId:
          title: Dataobjectid
          type: integer
        referenceId:
          title: Referenceid
          type: integer
    Scope:
      title: Scope
      required:
        - collectionId
        - traversalRules
      type: object
      properties:
        collectionId:
          title: Collectionid
          type: integer
        dataObjectId:
          title: Dataobjectid
          type: integer
        traversalRules:
          type: array
          items:
            $ref: "#/components/schemas/TraversalRule"
    SearchEntity:
      title: SearchEntity
      required:
        - query
        - queryType
      type: object
      properties:
        query:
          $ref: "#/components/schemas/Query"
        queryType:
          $ref: "#/components/schemas/QueryType"
    SearchRequest:
      title: SearchRequest
      required:
        - scopes
        - search
      type: object
      properties:
        scopes:
          title: Scopes
          type: array
          items:
            $ref: "#/components/schemas/Scope"
        search:
          $ref: "#/components/schemas/SearchEntity"
    SearchResult:
      title: SearchResult
      required:
        - resultSet
        - search
      type: object
      properties:
        resultSet:
          title: Resultset
          type: array
          items:
            $ref: "#/components/schemas/Result"
        search:
          $ref: "#/components/schemas/SearchEntity"
    TraversalRule:
      title: TraversalRule
      enum:
        - children
        - parent
        - predecessors
        - successors
      type: string
      description: An enumeration.
    ValidationError:
      title: ValidationError
      required:
        - loc
        - msg
        - type
      type: object
      properties:
        loc:
          title: Location
          type: array
          items:
            type: string
        msg:
          title: Message
          type: string
        type:
          title: Error Type
          type: string

8.9. Release Process

A shepard release consists of a new version number, build artifacts (container, clients, etc.), a release tag on main, and release notes.

8.9.1. Release frequency

Usually a new shepard version is released on the first Monday of the month. However, this date is not fixed and can be postponed by a few days if necessary. This monthly release increases the release version number.

We use semantic versioning, meaning that the version number consists of a major, minor and patch number in the following format: MAJOR.MINOR.PATCH. Minor is the default version increase for a release. Breaking changes imply a Major release. And hotfixes or patches are performed as a Patch release (see Hotfix process below).

Currently there are two workflows for releases, one minor/ major release and a patch release. Both release types are explained step by step below.

8.9.2. Performing releases

These steps describe a regular (monthly) release for shepard but can also be used to release an unplanned patch release.

Furthermore, there are two ways to create an unplanned patch/ hotfix release.

The first option is the more classical hotfix approach, meaning that it only brings the changes from merge requests containing hotfixes from the develop branch to the main branch. The steps needed for this option are explained below in the section: Performing a hotfix release.

The second option includes creating an MR containing the needed patch changes on develop branch, then merge the develop branch on main and create a new minor release. This would create a new out-of-cycle release containing the patch and all changes from develop since the last release. This option is the same procedure as a regular release, which is described right below.

The following steps are necessary to provide a new minor or major release of shepard:
  1. Finish development and make sure the develop branch is stable, the pipeline is successful and no code reviews are open

  2. Optional: Merge the main branch into develop in order to reapply any cherry-picked commits

  3. Merge the develop branch into the main branch

  4. Prepare an official release by using the shepard release script

  5. To setup the release script, follow the steps listed in the Scripts README.md

  6. Run the following command:

    poetry run cli release ./token.txt
  7. The script will ask if the release is Patch, Minor or Major and calculates the new version accordingly. The script automatically uses Major if the previous changes contain breaking changes.

  8. Verify the listed merged merge requests

  9. Verify the release notes created by the script. (editor opens automatically)

  10. Suggest a release title that will be appended to the version number.

  11. Confirm the generated details.

  12. Verify that everything was successfully created. (GitLab Release, Release Notes, etc.)

8.9.3. Performing a hotfix release

Hotfixes are changes to the main branch outside the regular releases to fix urgent bugs or make small changes that need to be applied to the default branch of the project. The steps below describe how one can release a single hotfix MR without having to merge the develop branch on main. This means that the other changes on develop branch are only merged when a new regular release is created.

Hotfix process
  1. As usual, a merge request with the hotfix must be created, reviewed, and merged to develop

  2. The resulting MR commit must be cherry-picked from develop to main

    git checkout main
    git cherry-pick <commit-hash>
    git push
  3. The shepard release script needs to be run, in order to create a new hotfix release.

  4. To setup the release script, follow the steps listed in the Scripts README.md

  5. Run the following command:

    poetry run cli release ./token.txt
  6. The script will ask if the release is Patch, Minor or Major and calculates the new version accordingly. Here you should select a Patch version, since you only want to release a hotfix/ patch.

  7. Verify the listed merged merge request

  8. Verify the release notes created by the script. (editor opens automatically)

  9. Suggest a release title that will be appended to the version number.

  10. Confirm the generated details.

  11. Verify that everything was successfully created. (GitLab Release, Release Notes, etc.)

8.9.4. Actions done by the release script in the background

The following steps are carried out by the release script:
  • Collecting all previous merge requests from the last version until now.

    • Analyze if previous changes contain breaking changes.

  • A Gitlab Release including release notes directed at administrators and users is created

    • The title is the title given by the user concatenated with the version tag

    • A short paragraph describes the most important changes

    • Breaking changes are listed in a prominent way

    • Other changes besides dependency updates are listed below

  • A release tag <version number> on main is created

    • The script automatically uses a Major version increase if the previous changes contain breaking changes.

  • Ask the user if the script should automatically create a 'Update Dependencies issue' for the current milestone after performing a successful release. This is done since we agreed on updating all dependencies after performing a release.

8.10. Configuration

8.10.1. Backend (Quarkus)

This section is a short summary of this page.

Application Properties
Setting Properties

Quarkus reads configuration properties from several sources. More information on the sources and how they override each other can be found here.

We define a standard value for most properties under src/main/resources/application.properties. For the dev and test environment, we provide properties with a %dev, %test or %integration prefix overriding the default value.

Additonally they can be overridden locally using a .env file. We use this for configuration differing between developers, e.g. the OIDC config. In a dockerized setup they can be overridden by providing environment variables to the service.

To support administrators, relevant configuration options are documented in infrastructure/.env.example and infrastructure/README.md.

Reading Properties

Poperties can be either injected or accessed programatically.

Feature Toggles

With feature toggles we want to conditionally build shepard images with or without a certain feature. This is especially useful for features under development.

To define a feature toggle, we add the property to configuration.feature.toggles.FeatureToggleHelper and create a class in configuration.feature.toggles that contains the name of the property, an isEnabled method as well as the method id of isEnabled. An example could look like this:

package de.dlr.shepard.configuration.feature.toggles;

public class ExperimentalTimeseriesFeatureToggle {

  public static final String TOGGLE_PROPERTY = "shepard.experimental-timeseries.enabled";

  public static final String IS_ENABLED_METHOD_ID =
    "de.dlr.shepard.configuration.feature.toggles.ExperimentalTimeseriesFeatureToggle#isEnabled";

  public static boolean isEnabled() {
    return FeatureToggleHelper.isToggleEnabled(TOGGLE_PROPERTY);
  }
}

We can then use this feature toggle in multiple ways:

Conditionally Excluding Beans at Buildtime

Quarkus provides us with a mechanism to conditionally exclude beans at buildtime. For example, the endpoints of an experimental feature can be enabled or disabled at build time to be included in dev builds but excluded in release builds.

For example, the ExperimentalTimeseriesRest can have a @IfBuildProperty annotation like this:

@Consumes(MediaType.APPLICATION_JSON)
@Produces(MediaType.APPLICATION_JSON)
@Path(Constants.EXPERIMENTAL_TIMESERIES_CONTAINERS)
@RequestScoped
@IfBuildProperty(name = ExperimentalTimeseriesFeatureToggle.TOGGLE_PROPERTY, stringValue = "true")
public class ExperimentalTimeseriesRest {
  ...
}

In this example the endpoints are only available when shepard.experimental-timeseries.enabled was true at build time.

The @IfBuildProperty annotations are evaluated at build-time. Make sure to add the property to the application.properties file during build, so that the build artifact has the same value at runtime.

See here for more information.

Connect further Configuration with a Feature Toggle

For a feature toggle, we want a single property to control it.

Control further config options

In case we need to adapt further configuration based on the feature toggle (e.g. disabling hibernate), we can reference the property like this:

shepard.experimental-timeseries.enabled=false

quarkus.hibernate-orm.active=${shepard.experimental-timeseries.enabled}

To reenable a feature for a dev or test profile, we can then activate the toggle for these profiles.

Conditionally executing tests

In order to execute tests conditionally based on a toggle we use the isEnabled and METHOD_ID method of the feature toggle class.

In a test class or method, we can then add a annotation like this:

@QuarkusTest
@EnabledIf(ExperimentalTimeseriesFeatureToggle.IS_ENABLED_METHOD_ID)
public class ExperimentalTimeseriesContainerServiceTest {
  ...
}
Feature toggles in the pipeline

Don’t mistake the build profiles dev and prod for the profiles for dev and prod images.

To have our dev environment as close to the production environment as possible, the dev images are also built using the prod profile. In order to enable the feature for a dev or prod build, we provide the feature toggle in the get-version pipeline job for dev or prod.

Make sure to provide the toggle for all pipelines and adapt it in the application.properties before building. Otherwise the value of the test profile is used, which can lead to errors.

8.10.2. Frontend (Nuxt)

This section is a short summary of this page.

Setting properties

We define environment variables in the Nuxt config like this:

export default defineNuxtConfig({  runtimeConfig: {
  // A value that should only be available on server-side
  apiSecret: '123',
  // Values that should be available also on client side
  public: {
    apiBase: '/api'
  }
}})

These values can be overridden by a .env file like this:

NUXT_API_SECRET=api_secret_token
NUXT_PUBLIC_API_BASE=https://nuxtjs.org

In order to ease the configuration we provide a .env.example file with all relevant variables. That file can be copied to .env and filled with the appropriate values.

Reading Properties

Properties can be accessed using useRuntimeConfig().

8.11. Versioning

8.11.1. Introduction

As a shepard user I want to be able to use different versions of data sets to facilitate collaboration in a research project and lay the groundwork for future features like branching, visualization of differences or restore functionality.

I can define a version of a collections to mark a milestone in the project in order to freeze the status that the data set has right now via the API. There is always the one active version called HEAD which is the working copy that can be edited on shepard and there can be n versions on shepard that are then read-only. If I never define a version as a user then there is no change in functionality for me. Versions are identified by a UUID. Version should be applied for organizational elements, not for payload data. A version is always for a whole collection. Data objects and references inherit the version from their enclosing collection. Versioning is explicit, users have to create versions actively.

8.11.2. Behavior

The following image displays a collection with references with no manually created versions.

base case

After the creation of a new version, the data will look like this:

create version

Semantic Annotations are copied when creating a version, just like the collection, data objects and references.

Permissions are collection-wide and across all versions.

8.11.3. Endpoints

Endpoint Description Request Body Response

POST /collections

create first collection

{
  "name": "collection1",
  "description": "first collection"
}
{
  "id": "cid1",
  "createdAt": "date1",
  "createdBy": "user1",
  "updatedAt": "date1",
  "updatedBy": "user1",
  "name": "collection1",
  "description": "first collection",
  "dataObjectIds": [],
  "incomingIds":[]
}

POST /collections

create second collection

{
  "name": "collection2",
  "description": "second collection"
}
{
  "id": "cid2",
  "createdAt": "date2",
  "createdBy": "user1",
  "updatedAt": "date2",
  "updatedBy": "user1",
  "name": "collection2",
  "description": "second collection",
  "dataObjectIds": [],
  "incomingIds":[]
}

POST /collections/cid1/versions

create first version of first collection

{
  "name": "collection1version1",
  "description": "first version of collection1"
}
{
  "uid": "collection1version1uid",
  "name": "collection1version1",
  "description": "first version of collection1",
  "createdAt": "date3",
  "createdBy": "user1",
  "predecessorUUID": null
}

GET /collection/cid1/versions

get versions of first collection

[
  {
    "uid": "collection1version1uid",
    "name": "collection1version1",
    "description": "first version of collection1",
    "createdAt": "date3",
    "createdBy": "user1",
    "predecessorUUID": null
  },
  {
    "uid": "collection1HEADVersionuid",
    "name": "HEAD",
    "description": "HEAD",
    "createdAt": "date1",
    "createdBy": "user1",
    "predecessorUUID": "collection1version1uid"
  }
]

POST /collections/cid1/dataObjects

create first dataObject in first collection

{
  "name": "collection1DataObject1",
  "description": "first dataObject of collection 1"
}
{
  "id": c1did1,
  "createdAt": "date4",
  "createdBy": "user1",
  "updatedAt": "date4",
  "updatedBy": "user1",
  "name": "collection1DataObject1",
  "description": "first dataObject of collection 1",
  "collectionId": cid1,
  "referenceIds":[],
  "successorIds": [],
  "predecessorIds": [],
  "childrenIds": [],
  "parentId": null,
  "incomingIds": []
}

POST /collections/cid1/dataObjects

create second dataObject in first collection with first dataObject in first collection as parent

{
  "name": "collection1DataObject2",
  "description": "second dataObject of collection 1",
  "parentId": c1did1
}
{
  "id": c1did2,
  "createdAt": "date5",
  "createdBy": "user1",
  "updatedAt": "date4",
  "updatedBy": "user1",
  "name": "collection1DataObject2",
  "description": "second dataObject of collection 1",
  "collectionId": cid1,
  "referenceIds":[],
  "successorIds": [],
  "predecessorIds": [],
  "childrenIds": [],
  "parentId": c1did1,
  "incomingIds": []
}

GET /collections/cid1/dataObjects?versionUID=collection1version1uid

there are no dataobjects in the first version of collection1

[]

8.11.4. Edge Case: CollectionReferences and DataObjectReferences

When we create a new version of a referenced collection, the reference will move with the HEAD and the old collection will not be referenced anymore:

new version of referenced

When we referenced an old version of a collection and a new version is created, the reference stays unchanged:

old version referenced

POST /collections/cid2/dataObjects

create first dataObject in collection 2

{
  "name": "collection2DataObject1",
  "description": "first dataObject of collection 2"
}
{
  "id": c2did1,
  "createdAt": "date6",
  "createdBy": "user1",
  "updatedAt": "date6",
  "updatedBy": "user1",
  "name": "collection2DataObject1",
  "description": "first dataObject of collection 2",
  "collectionId": cid2,
  "referenceIds":[],
  "successorIds": [],
  "predecessorIds": [],
  "childrenIds": [],
  "parentId": null,
  "incomingIds": []
}

POST /collections/cid1/versions

create first version of collection 2

{
  "name": "collection2version1",
  "description": "first version of collection2"
}
{
  "uid": "collection2version1uid",
  "name": "collection2version1",
  "description": "first version of collection2",
  "createdAt": "date7",
  "createdBy": "user1",
  "predecessorUUID": null
}

POST /collections/cid1/dataObjects/c1did1/dataObjectReferences

create dataObjectReference from first dataObject in collection1 to first dataObject in collection2 without version

{
  "name": "refToc2do1HEAD",
  "referencedDataObjectId": c2did1,
  "relationship": "divorced"
}
{
  "id": refToc2do1HEADId,
  [...]
  "name": "refToc2do1HEAD",
  "dataObjectId": c1did1,
  "type": "DataObjectReference",
  "referencedDataObjectId": c2did1,
  "referencedVersionUid": null,
  "relationship": "divorced"
}

POST /collections/cid1/dataObjects/c1did1/dataObjectReferences

create dataObjectReference from first dataObject in collection1 to first dataObject in collection2 with version

{
  "name": "refToc2do1Version",
  "referencedDataObjectId": c2did1,
  "referencedVersionUid": "collection2version1",
  "relationship": "married"
}
{
  "id": refToc2do1VersionId,
  [...]
  "name": "refToc2do1Version",
  "dataObjectId": c1did1,
  "type": "DataObjectReference",
  "referencedDataObjectId": c2did1,
  "referencedVersionUid": "collection2version1",
  "relationship": "married"
}

GET /collections/cid2/dataObjects/c2did1

fetch referenced dataObject with incoming counter

{
  "id": c2did1,
  [...]
  "name": "collection2DataObject1",
  "description": "first dataObject of collection 2",
  "collectionId": cid2,
  "incomingIds": [refToc2do1VersionId]
}

POST /collections/cid1/versions

create second version of collection 2

{
  "name": "collection2version2",
  "description": "second version of collection2"
}
{
  "uid": "collection2version2uid",
  "name": "collection2version2",
  "description": "second version of collection2",
  [...]
  "predecessorUUID": "collection2version1uid"
}

GET /collections/cid2/dataObjects/c2did1

fetch referenced dataObject from HEAD, incoming is still the same

{
  "id": c2did1,
  [...]
  "name": "collection2DataObject1",
  "description": "first dataObject of collection 2",
  "collectionId": cid2,
  "incomingIds": [refToc2do1VersionId]
}

GET /collections/cid2/dataObjects/c2did1?versionUID=collection2version2uid

fetch referenced dataObject from version 2, incoming is now empty

{
  "id": c2did1,
  [...]
  "name": "collection2DataObject1",
  "description": "first dataObject of collection 2",
  "collectionId": cid2,
  "incomingIds": []
}

8.12. OpenAPI Specification

Quarkus provides the Smallrye OpenAPI extension. Documentation on OpenAPI and swagger can be found here. The generated OpenAPI spec is available at /shepard/doc/swagger-ui of a running shepard backend.

8.12.1. Enhancing the Schema with Filters

The generated schemas can be adapted using filters. For example, we use this to adapt the paths to match the root path of our api.

More information on this can be found here.

8.12.2. Path Parameter Order

Quarkus sorts the list of path parameters in the OpenAPI spec alphabetically instead of by occurrence in the path.

For example, the following endpoint:

  @DELETE
  @Path("/{" + Constants.APIKEY_UID + "}")
  @Tag(name = Constants.APIKEY)
  @Operation(description = "Delete api key")
  @APIResponse(description = "deleted", responseCode = "204")
  @APIResponse(description = "not found", responseCode = "404")
  public Response deleteApiKey(
    @PathParam(Constants.USERNAME) String username,
    @PathParam(Constants.APIKEY_UID) String apiKeyUid
  ) {
    // Some code
  }

will lead to the following OpenAPI spec:

    delete:
      tags:
      - apikey
      description: Delete api key
      operationId: deleteApiKey
      parameters:
      - name: apikeyUid
        in: path
        required: true
        schema:
          type: string
      - name: username
        in: path
        required: true
        schema:
          type: string
      responses:
        "204":
          description: deleted
        "404":
          description: not found

As we want the parameters to be sorted by occurrence in the path, this is not intended and can lead to issues in generated clients.

To fix this, we define the order of path and query params in the OpenAPI spec manually using @Parameter annotations. We do this for all path and query parameters.

For the above example, the result would look like this:

  @DELETE
  @Path("/{" + Constants.APIKEY_UID + "}")
  @Tag(name = Constants.APIKEY)
  @Operation(description = "Delete api key")
  @APIResponse(description = "deleted", responseCode = "204")
  @APIResponse(description = "not found", responseCode = "404")
  @Parameter(name = Constants.USERNAME)
  @Parameter(name = Constants.APIKEY_UID)
  public Response deleteApiKey(
    @PathParam(Constants.USERNAME) String username,
    @PathParam(Constants.APIKEY_UID) String apiKeyUid
  ) {
    // Some code
  }

8.12.3. Format specifier for Datetime

When using Java’s old date API, the generated openapi specification interprets the Date object as a date-only field.

The code snippet below:

import java.util.Date;

public class SomeClass {
  @JsonFormat(shape = JsonFormat.Shape.STRING)
  private Date createdAt;
}

generates the following openapi yaml snippet:

SomeClass:
  type: object
  properties:
    createdAt:
      format: date
      type: string
      example: 2024-08-15

However, the Date object stores date and time, and should be treated by the openapi specification as an object that handles both date and time.

In the example, a modification on the createdAt field is needed to explicitly specify that the generated format should be in datetime format. The code snippet below adds a @Schema annotation, which specifies the format field:

import java.util.Date;

public class SomeClass {
  @JsonFormat(shape = JsonFormat.Shape.STRING)
  @Schema(format = "date-time", example = "2024-08-15T11:18:44.632+00:00")
  private Date createdAt;
}

This annotation then results in the following openapi specification containing a date-time format:

SomeClass:
  type: object
  properties:
    createdAt:
      format: date-time
      type: string
      example: 2024-08-15T11:18:44.632+00:00

So, when using Java’s Date object of the old java.util.Date API, we need to explicitly specify the format in schema annotation, to generate a datetime object in the openapi specification.

8.12.4. Correct multipart file upload for Swagger UI

To enable multipart file uploads in Swagger, the following schema is expected (Source):

requestBody:
  content:
    multipart/form-data:
      schema:
        type: object
        properties:
          filename:
            type: array
            items:
              type: string
              format: binary

Until now we weren’t able to reproduce this exact schema together with a proper file upload in the Swagger ui. Especially since we have requirements on the filename property to be not-null and required. So with annotations-only we could not reproduce this schema in Quarkus.

However, the following code snippet construct of interfaces/ classes and using the implementation field in the @Schema annotation, we were able to reproduce functionality for a file upload in the Swagger UI and a proper openapi schema.

@POST
@Consumes(MediaType.MULTIPART_FORM_DATA)
public Response createFile(
  MultipartBodyFileUpload body
) {
  // ... file handling code ...
}

@Schema(implementation = UploadFormSchema.class)
public static class MultipartBodyFileUpload {
  @RestForm(Constants.FILE)
  public FileUpload fileUpload;
}

public class UploadFormSchema {
  @Schema(required = true)
  public UploadItemSchema file;
}

@Schema(type = SchemaType.STRING, format = "binary")
public interface UploadItemSchema {}

This generates the following openapi specification:

paths:
  /examplepath:
    post:
      requestBody:
        content:
          multipart/form-data:
            schema:
              $ref: "#/components/schemas/MultipartBodyFileUpload"

components:
  schemas:
    MultipartBodyFileUpload:
      $ref: "#/components/schemas/UploadFormSchema"
    UploadFormSchema:
      required:
      - file
      type: object
      properties:
        file:
          $ref: "#/components/schemas/UploadItemSchema"
    UploadItemSchema:
      format: binary
      type: string

This specification is rather complex and nested, but allows i.e., to add a required field to the UploadItemSchema schema, which is then generated as a required field in the Swagger UI.

One problem of this approach is that this construct of MultipartBodyFileUpload, UploadFormSchema and UploadItemSchema is needed for every REST endpoint that utilizes a multipart fileupload.

The solution is a combination of these two resources:

8.13. Generated Backend Clients

In order to ease the usage of the backend API, we maintain and publish generated backend clients. They are generated using the OpenAPI Generator.

We currently build and publish clients for Java, Python and TypeScript as part of our release process. In addition to the OpenAPI diff job there are jobs to check if there are changes in the generated clients for the TypeScript and Python client.

In the past there was a python-legacy client and a Cpp client published. They have been discontinued.

8.13.1. Backend Client for shepard Frontend

In order to support concurrent development of frontend and backend we decided to put the generated client for the frontend under version control (ADR-007 Client Generation for Frontend). The client can be found under backend-client. It’s exported members can be imported in frontend files like this:

import { SemanticRepositoryApi } from "@dlr-shepard/backend-client";
import { getConfiguration } from "./serviceHelper";

export default class SemanticRepositoryService {
  static createSemanticRepository(params: CreateSemanticRepositoryRequest) {
    const api = new SemanticRepositoryApi(getConfiguration());
    return api.createSemanticRepository(params);
  }
}
(Re)generating the Client

In case the API changed or a new version of the OpenAPI generator shall be used, the client has to be regenerated. This can be done by running the following command in the top level directory. Be aware that a local Java installation is required for the command to run successfully.

npm run client-codegen

The script will also persist the OpenAPI specification used for generation. Afterwards, the frontend code may have to be adjusted.

In order to check if the client is up to date, the generator version as well as the current OpenAPI specification is compared with the ones used for generation in a pipeline job.

8.14. Testing Strategy

To automatically test shepard, several strategies are used. They are described in this section.

8.14.1. Unit Tests

We use junit5 for unit testing parts of our backend code. We aim to cover everything except endpoints by our unit tests.

@QuarkusTest with Running Databases

For special cases, we use tests with the @QuarkusTest annotation to test beans of a running quarkus instance with running databases. This is especially used for behaviour strongly coupled to databases, in order to reduce the need for mocking and get more precise test results. They are executed in a seperate job in the pipeline in order to provide the needed databases.

8.14.2. Integration Tests

To test the overall functionality of the backend we test our http endpoints with integration tests using @QuarkusIntegrationTest. In the pipeline, the tests are executed on a quarkus instance based on the build artifact built in the pipeline.

Integration Tests utilizing External REST APIs

Some of the integration tests in the backend rely on an external REST API. One example are the integration tests that utilize semantic annotations like the CollectionSearcherIT. This test includes creating a semantic repository with an external ontology service and executing requests against this endpoint.

This introduces an external dependency to our integration tests, which we cannot control. If the external service is not available, the HTTP connection results in a timeout and the whole integration test fails, even though this is not related to our backend code. Since we want to test if the health check to the external service works, we cannot replace the health check function with a mocked version of this function. By introducing WireMock, it is possible to mock the HTTP response itself.

WireMock is a testing framework to map HTTP responses to specific HTTP requests. WireMock acts as a simple HTTP server in the background that allows defining rules to match on HTTP requests with pre-defined HTTP responses. For example, this code snippet in the WireMockResource.java mocks the behavior for the health check against an external ontology service:

wireMockServer.stubFor(
      // stub for health check on: https://dbpedia.org/sparql/
      get(urlPathEqualTo("/sparql"))
        .withQueryParam("query", equalTo("ASK { ?x ?y ?z }"))
        .willReturn(aResponse().withStatus(200).withBody("{ \"head\": {\"link\": []  }, \"boolean\": true }"))
    );

The rules in this snippet are defined as the following: for every GET request to localhost:PORT/sparql with the query parameter query=ASK { ?x ?y ?z }, the WireMock HTTP server will return a HTTP response with code 200, containing this JSON string in its body "{ \"head\": {\"link\": [] }, \"boolean\": true }".

Since we are using Quarkus as our backend framework, we utilize the Quarkus WireMock Extension. This extensions allows an easier integration into an existing Quarkus application. It directly supports injection, to inject a WireMock server in an integration test. Generally, injection is our preferred way to initialize objects.

However, injection is not used in our current implementation of WireMock mocking due to limitations of our concrete scenario, i.e. usage in static functions, where injection is not possible. Therefore, we utilize WireMock in a static approach.

A proper way to integrate WireMock into a Quarkus integration test is described in the extension’s introduction page and also by the official Quarkus guide for test resources.

WireMock is a strong tool and provides many options to mock complex web services. To name only a few possibilities with WireMock, you are able to use response templates, it provides proxying to forward specific requests, and it supports multiple protocols like gRPC,GraphQL, HTTPS and JWT.

8.14.3. Load and Performance Tests

We are using grafana k6 for load and performance tests. They are in the directory load-tests and are written in TypeScript. Tests can only be triggered on a local development computer but can be configured to use the local or the dev environment.

Configuration
  • Create a file under load-tests/mount/settings.json

  • Copy contents from load-tests/mount/settings.example.json

  • Adapt configuration settings as needed

  • run npm install in load-tests/

Execute tests

There is a shell script run-load-test.sh that can be used to execute load tests. It takes the test file to execute as first parameter.

./run-load-test.sh src/collections/smoke-test.ts
Good to know
  • Webpack is used for bundling all dependencies into the test file.

  • Webpack uses ts-loader to transpile TypeScript to JavaScript.

  • K6 does not use a node environment. Therefore some functionality is not available.

  • Webpack.config.js identifies entry points (tests) dynamically. All *.ts files are added automatically if they are not located in the utils folder.

  • K6 pushes some metrics to prometheus after test execution.

  • To run the tests against a locally run backend on linux, you need to put the ip address into settings.json

8.15. Shepard Exceptions

plantUMLexceptions

When an exception is to be thrown than in most cases it should be of the abstract type ShepardException. The ShepardExceptionMapper is able to handle such exceptions and in turn informs the user in a human readable way about that exception.

Currently, there are four different sub-types of the ShepardException:

  • InvalidAuthException: The InvalidAuthException is to be thrown when a resource is accessed without sufficient permissions.

  • InvalidRequestException: The InvalidRequestException is used when the request misses required information or is otherwise invalid.

  • ShepardParserException: The ShepardParserException is used by the search package to indicate that the search query could not be parsed.

  • ShepardProcessingException: The ShepardProcessingException indicates an arbitrary issue while processing the request.

8.16. Subscription Feature

The subscriptions feature allows users to react on certain events. Shepard defines some rest endpoints as subscribable. Users than can subscribe to requests handled by these endpoints. In addition to a specific endpoint user have to specify a regular expression which is matched with the respective URL as well as a callback endpoint. The callback endpoint is called by shepard when an endpoint triggers a subscription and the regular expression matches. The callback contains the respective subscription, the actually called URL as well as the ID of the affected object. The callback itself is executed asynchronously to avoid slowing down the response to the request in question.

A common use case for this feature is the automatic conversion of certain data types. For example, if a user wants to know about every file uploaded to a specific container, they would create a subscription in the following form:

{
  "name": "My Subscription",
  "callbackURL": "https://my.callback.com",
  "subscribedURL": ".*/files/123/payload",
  "requestMethod": "POST"
}

Once shepard has received a matching request, it sends the following POST' request to the specified callback URL `https://my.callback.com:

{
  "subscription": {
    "name": "My Subscription",
    "callbackURL": "https://my.callback.com",
    "subscribedURL": ".*/files/123/payload",
    "requestMethod": "POST"
  },
  "subscribedObject": {
    "uniqueId": "123abc"
  },
  "url": "https://my.shepard.com/shepard/api/files/123/payload",
  "requestMethod": "POST"
}

8.17. Theming

We are using vuetify as component library so we followed their theme configuration guide.

8.17.1. Global definitions and overrides

Global definitions like the font we use and typographies are defined as sass variables under 'nuxtend/styles/settings.scss'. Global overrides of component specific properties are also defined there.

8.17.2. Theme colors

The theme itself that mainly contains colors are defined in 'nuxtend/plugins/vuetify.ts'. The colors are taken over from the Style Guide that resides in Figma.

8.17.3. Styling individual components

There are multiple possibilities to style vue components. We agreed on the following order when styling components.

  1. Use global overrides with sass variables if all components of the same type are affected.

  2. Use properties of the components if they exist, e.g. VButton has an 'color' property.

  3. Use the class property of components to use predefined css helper classes. In the vuetify documentation there is a list of utility classes under 'Styles and animations'.

  4. Use the <style> tag to override css classes directly.

8.18. Session Management

As soon as a user authenticates himself, a session is created. We use the session mainly to store the tokens and some user information. We DO NOT persist the session anywhere on the server. As soon as the server restarts or the session ends, the information is lost.

In order to store user specific data like favorites or user selection we make use of the browser storage.

8.18.1. Local storage

In order to access the browsers local storage we make use of VueUse. It has a method called useStorage which gives us access to the local storage. With a key we can access the storage and fetch already stored data. If no data has been found it will fallback to the default value which can be provided in the function parameters.

const state = useStorage('my-store', {hello: 'hi', greeting: 'Hello' })

8.18.2. Session storage

To use session storage you can simply add "sessionStorage" as a third parameter.

const state = useStorage('my-store', {hello: 'hi', greeting: 'Hello' }, sessionStorage)

9. Architecture Decisions

9.1. ADR-000 API V2

26.08.2024: With changes implemented in this MR, some of the endpoints definitions went through minor changes.

Date

2021

Status

Done

Context

1. Ideas

  • Major(!) rework of data structure

  • following Fair Digital Object (FDO)

  • https://www.nist.gov/programs-projects/facilitating-adoption-fair-digital-object-framework-material-science

  • in this context -→ PID (Persistant Identifier) und DTR (Data Type Registry)

  • Collections

  • Collections contain data-objects

  • enables main data (Stammdaten), personal data collections

  • Basic Permissions can be managed via collections

  • References

  • https://www.rd-alliance.org/groups/research-data-collections-wg.html

  • http://rdacollectionswg.github.io/apidocs/#/

  • https://github.com/RDACollectionsWG/specification

  • DataObjects

  • While collections are high-level objects to manage things, DataObjects are there to aggregate related data references

  • DataObjects can be used to model a measurement, a situation, a component, etc.

  • Has relationships with other DataObjects (hierarchical and chronological)

  • References

  • Points to one or more datasets within one container

  • Expresses a part-of relationship between the dataset and the parent data object, as opposed to EntityReference which only references another entity

  • Container

  • A container is a separate area within an internal database (e.g. a database in InfluxDB or a collection in MongoDB)

  • Detailed permissions for these internal databases are managed via containers

  • To store data, a container must be created beforehand

  • Data can only be stored within an existing container

Solution

1. Endpoints

Organisational Entities
  • /collections - get all collections, create a collections

  • /collections/<id> - get/update/delete a specific collections

  • /collections/<id>/dataObjects/<id> - get/update/delete a specific DataObject

  • /collections/<id>/dataObjects/<id>/references - get all references of the given DataObject

  • /collections/<id>/dataObjects/<id>/references/<id> - get/update/delete a specific reference

User
  • /user - get the current User

  • /user/<username> - get a specific user

  • /user/<username>/apikeys - get all API keys, create a API key

  • /user/<username>/apikeys/<id> - get/update/delete a specific API key

  • /user/<username>/subscriptions - get all subscriptions

  • /user/<username>/subscriptions/<id> - get/update/delete a specific subscription

Database Integrations

The following endpoints exist optionally for each kind of data:

Structured Data Container
  • /structureddata - create structured data container

  • /structureddata/<id> - get/update/delete a specific structured data container

  • /structureddata/<id>/search - backend search service for structured data container

  • /structureddata/<id>/payload - upload a new structured data object

  • /collections/<id>/dataObjects/<id>/structureddataReferences - get all references of the given DataObject, create a new structured data reference

  • /collections/<id>/dataObjects/<id>/structureddataReferences/<id> - get/update/delete a specific structured data reference

  • /collections/<id>/dataObjects/<id>/structureddataReferences/<id>/payload - get the payload of a specific structured data reference

File Container
  • /file - create file container

  • /file/<id> - get/update/delete a specific file container

  • /file/<id>/payload - upload a new file

  • /collections/<id>/dataObjects/<id>/fileReferences - get all references of the given DataObject, create a new file reference

  • /collections/<id>/dataObjects/<id>/fileReferences/<id> - get/update/delete a specific file reference

  • /collections/<id>/dataObjects/<id>/fileReferences/<id>/payload - get the payload of a specific file reference

Timeseries Container
  • /timeseries - create timeseries databases

  • /timeseries/<id> - get/update/delete a specific timeseries

  • /timeseries/<id>/payload - upload new timeseries

  • /collections/<id>/dataObjects/<id>/timeseriesReferences - get all references of the given DataObject, create a new timeseries reference

  • /collections/<id>/dataObjects/<id>/timeseriesReferences/<id> - get/update/delete a specific timeseries reference

  • /collections/<id>/dataObjects/<id>/timeseriesReferences/<id>/payload - get the payload of a specific timeseries reference

2. Filtering

Some filter option can be implemented:

  • /collections/<id>/dataObjects/<id>/structureddataReferences?fileName=MyFile

  • /collections/<id>/dataObjects/<id>/fileReferences?fileName=MyFile&recursive=true - also searches for references of its sub-entities

  • /collections/<id>/dataObjects/<id>/timeseriesReferences/<id>/attachment?field=value&symbolicName=temperature_A1 - filter timeseries by attributes

  • …​

3. Behaviour

When a generated API client is used and existing objects are modified, only explicitly modified properties should be changed.

Example:

TypeA objA = api.getTypeA(...);

objA.setParameterX(...);

api.updateTypeA(objA);

In this example, only ParameterX should be modified, all other fields, relations, etc. should remain untouched.

4. Entities

This is an internal class diagram. Some attributes are hidden or changed for the user.

class diagram

5. Example Structures

The following structures are examples that demonstrate the user’s view of entities.

Collection
{
  "id": 0,
  "createdAt": "2021-05-21T11:30:53.411Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:30:53.411Z",
  "updatedBy": "string",
  "name": "string",
  "description": "string",
  "attributes": {
    "additionalProp1": "string",
    "additionalProp2": "string",
    "additionalProp3": "string"
  },
  "incomingIds": [0],
  "dataObjectIds": [0]
}
DataObject
{
  "id": 0,
  "createdAt": "2021-05-21T11:31:14.846Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:31:14.846Z",
  "updatedBy": "string",
  "name": "string",
  "description": "string",
  "attributes": {
    "additionalProp1": "string",
    "additionalProp2": "string",
    "additionalProp3": "string"
  },
  "incomingIds": [0],
  "collectionId": 0,
  "referenceIds": [0],
  "successorIds": [0],
  "predecessorIds": [0],
  "childrenIds": [0],
  "parentId": 0
}
BasicReference
{
  "id": 0,
  "createdAt": "2021-05-21T11:31:42.658Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:31:42.658Z",
  "updatedBy": "string",
  "name": "string",
  "dataObjectId": 0,
  "type": "string"
}
CollectionReference(BasicReference)
{
  "id": 0,
  "createdAt": "2021-05-21T11:32:00.172Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:32:00.172Z",
  "updatedBy": "string",
  "name": "string",
  "collectionId": 0,
  "type": "DataObjectReference",
  "referencedDataObjectId": 0,
  "relationship": "string"
}
DataObjectReference(BasicReference)
{
  "id": 0,
  "createdAt": "2021-05-21T11:32:00.172Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:32:00.172Z",
  "updatedBy": "string",
  "name": "string",
  "dataObjectId": 0,
  "type": "DataObjectReference",
  "referencedDataObjectId": 0,
  "relationship": "string"
}
URIReference(BasicReference)
{
  "id": 0,
  "createdAt": "2021-05-21T11:32:28.143Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:32:28.143Z",
  "updatedBy": "string",
  "name": "string",
  "dataObjectId": 0,
  "type": "URIReference",
  "uri": "https://my-website.de/my_data"
}
TimeseriesReference(BasicReference)
{
  "id": 0,
  "createdAt": "2021-05-21T11:32:54.209Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:32:54.209Z",
  "updatedBy": "string",
  "name": "string",
  "dataObjectId": 0,
  "type": "TimeseriesReference",
  "start": 0,
  "end": 0,
  "timeseries": [
    {
      "measurement": "string",
      "device": "string",
      "location": "string",
      "symbolicName": "string",
      "field": "string"
    }
  ],
  "timeseriesContainerId": 0
}
TimeseriesContainer
{
  "id": 0,
  "createdAt": "2021-05-21T11:33:41.642Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:33:41.642Z",
  "updatedBy": "string",
  "name": "string",
  "database": "string"
}
TimeseriesPayload
{
  "timeseries": {
    "measurement": "string",
    "device": "string",
    "location": "string",
    "symbolicName": "string",
    "field": "string"
  },
  "points": [
    {
      "value": {},
      "timestamp": 0
    }
  ]
}
FileReference(BasicReference)
{
  "id": 0,
  "createdAt": "2021-05-21T11:50:40.071Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:50:40.071Z",
  "updatedBy": "string",
  "name": "string",
  "dataObjectId": 0,
  "type": "FileReference",
  "files": [
    {
      "oid": "string"
    }
  ],
  "fileContainerId": 0
}
FileContainer
{
  "id": 0,
  "createdAt": "2021-05-21T11:52:49.642Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:52:49.642Z",
  "updatedBy": "string",
  "name": "string",
  "oid": "string"
}
FilePayload

There is no such thing as a file payload, since a file is always treated as a binary stream

StructuredDataReference(BasicReference)
{
  "id": 0,
  "createdAt": "2021-05-21T11:50:40.071Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:50:40.071Z",
  "updatedBy": "string",
  "name": "string",
  "dataObjectId": 0,
  "type": "StructuredDataReference",
  "structuredDatas": [
    {
      "oid": "string"
    }
  ],
  "structuredDataContainerId": 0
}
StructuredDataContainer
{
  "id": 0,
  "createdAt": "2021-05-21T11:52:49.642Z",
  "createdBy": "string",
  "updatedAt": "2021-05-21T11:52:49.642Z",
  "updatedBy": "string",
  "name": "string",
  "mongoid": "string"
}
StructuredDataPayload
{
  "structuredData": {
    "oid": "string"
  },
  "json": "string"
}

9.2. ADR-001 Monorepository

Date

12.06.2024

Status

Done

Context

Currently the project is spread across multiple repositories for architecture work, backend, deployment, documentation, frontend, publication, releases and further tools of the ecosystem (shepard Timeseries Collector). This means increased effort when working with the repositories, especially for feature development concerning both the backend and the frontend. Also the documentation is not as close to the code as it could (reasonably) be.

Possible Alternatives

  1. Leave repositories as they are.

    • This means the current downsides persist.

    • No effort to implement

  2. Migrate the repositories except the shepard Timeseries Collector to a monorepo named shepard.

    • A dev setup for work on both frontend and backend could be set up easier. This can enhance development speed

    • Documentation is closer to the code. The probability of outdated, duplicated or even contradicting documentation(s) is decreased.

    • Less repositories have to be handled when working on or with shepard

Decision

We decide for migrating all repos except the shepard Timeseries Collector. The commit history, open issues, wikis and pipelines should be migrated to the monorepo.

Consequences

Monorepo has to be set up and previous projects have to be migrated

9.3. ADR-002 Backend Technology

Date

02.07.2024

Status

Done

Context

The purpose of shepards backend is to provide a generic REST interface for its frontend or external communication partners to retrieve and store data in different formats. This data includes different research data (e.g. timeseries data, structured documents, files) and a connecting meta data structure (graph structure). To persist the data it uses Neo4j, MongoDB and InfluxDB databases.

The backend of shepard is implemented as a basic Jakarta EE application using Jakarta Servlet and Tomcat. Further used libraries are selected individually, added on top and their interoperability is checked manually. There is no dependency injection. Due to its purpose the backend does not contain a lot of business logic, it rather functions as an adapter for the data types.

Replacing the current approach a framework should be chosen to provide more structure and a robust and future-proof architecture.

The tender for the extension of shepard listed the following requirements for a new framework:

  • be in broad use

  • have an active open source community

  • have detailed and up-to-date documentation

  • a good integration for databases and tools currently in use

  • a good developer experience

  • an easy test and development setup

  • integrate newer Java or DB versions quickly

  • no vendor lock-in

  • have an easy integration for the frontend

As the databases in use might change in the near future, not too much time should be spent on migrating the concerning code.

Possible Alternatives

The comparison of alternatives can be found in the appendix.

Decision

We decide to go with Option Quarkus, because it uses established standards as opposed to Spring Boot defining its own standards. It also feels more modern while still being in broad use.

Consequences

  • The application has to be migrated to Quarkus

  • Although we migrate the REST interface has to be stable

  • There will be breaking changes for administrators because configuration options may differ

  • Knowledge on Quarkus has to be shared in the xiDLRaso DEV team

  • We need to define a migration path, include development still in progress for the current backend and avoid duplicate work

9.3.1. Appendix

Table 1. Comparison of possible technologies
Keep current setup Spring Boot Quarkus Javalin Micronaut Non-Java Backend

Migration effort

No effort

(+) Medium Effort

(+) Medium Effort

(-) Large Effort (Most things have to be added manually)

Medium Effort

(-) Huge effort (everything has to be rewritten)

Migration benefit

(-) No benefit

(+) Big Benefit, Batteries included (Hibernate integration, Security tools, Dependency Injection out of the box)

(+) Big Benefit, Batteries included (Hibernate integration, Security tools, Dependency Injection out of the box)

(-) Low benefit, most things still have to be manually integrated (e.g. database clients & hibernate connection)

-

-

Rest API migration effort

-

(+) Medium Effort

(+) Medium Effort

Medium Effort

-

-

Broad use and active community

-

(++)Widely used, Huge Community

(+) In productive use, e.g. Keycloak. Medium but growing community

Medium but growing community

(-) Small, growing community, small Project

-

Detailed and Up-To-Date Documentation

-

(++) Detailed docs, lots of questions on stackoverflow (some of them may be outdated)

(+) Tutorials & Guides provided by Quarkus, some resources on stackoverflow

(+)

-

-

Good Integration for REST Interface, Neo4j, MongoDB, InfluxDB, potentially PostgreSQL

-

(+)

  • Plugin Available:

    • Neo4j

    • MongoDB

  • Via Hibernate:

    • PostgresSQL

  • Client Available:

    • InfluxDB

(+)

  • Plugin Available:

    • Neo4j

    • MongoDB

  • Via Hibernate:

    • PostgresSQL

  • Client Available:

    • InfluxDB

  • neo4j: (!) Only kotlin interface

  • mongoDB: (+)mongoDB client

  • Influx: (+) InfluxDB client

  • PostgresSQL via hibernate (+)

  • Manual integration of hibernate needed (!)

(+)

  • Plugin Available:

    • Neo4j

    • MongoDB

  • Via Hibernate:

    • PostgresSQL

  • Client Available:

    • InfluxDB

-

Developer Experience

-

(+) Great

(++) Best

Lots of boiler plate code, lots of integrations you have to write yourself

-

-

Easy dev tooling

-

(+) Fully integrated with IntelliJ Ultimate Support for Eclipse, VSCode, etc.

(+) Fully integrated with IntelliJ Ultimate Support for Eclipse, VSCode, etc.

(+) Standard Support for IntelliJ and Eclipse, no extra functionality e.g. Testing, Modelling, …

-

-

Testability

-

(+) Very flexible out of the box tools for Unit and Integration tests. However, “real” e2e tests need a framework like Spock, Cucumber or Cypress

(+) Extensive support for different testing mechanisms. Also expandable with other testing tools/frameworks. Real e2e tests may probably need a seperate framework

  • Unit Tests via mockito

  • Functional/integration tests via javalin.testtools

  • e2e & UI tests via selenium

-

-

Scalability

-

(+) Established support for kubernetes

(++) Great support for containerization, kubernetes and microservices. Startup time (e.g. for autoscaling is very fast)

-

-

Performance

-

(+) Similar to Quarkus

(+) Similar to Spring Boot. Extremely fast because of graalvm, native image support and small footprint

Small codebase and completely customizable

-

-

Ease of Updates

-

(+) Provides ways to analyze potentially breaking changes & diagnoses the Project

(+) Provides ways to analyze potentially breaking changes & diagnoses the Project

-

-

HTTP Endpoint Standard

-

(+) Spring

(++) JAX-RS (same standard as currently in use)

-

-

No Vendor Lock

-

(!) “Beans architecture” is used in other software too. Using Spring Boot includes using it’s features. So some “lock-in” will be there. But it is an open source framework.

(!) Using Quarkus includes using it’s features. So some “lock-in” will be there. But it is an open source framework.

-

-

Frontend easily integratable

-

(+) A REST or GraphQL API can be provided e.g. for a VueJS App.

(+) A REST or GraphQL API can be provided e.g. for a VueJS App.

-

-

Dependency Injection Pattern

-

(+) included

(+) included

(-) Only if self implemented

-

-

Singleton Pattern

-

(+) Beans

(+)

-

-

OIDC support for external ID provider

-

(+) There is an OAuth2 client for spring boot

(+) There is an oidc plugin

-

-

OIDC support with integrated ID provider

-

(+) Yes, with spring authorization server

(!) All resources on using OIDC with Quarkus expect a seperate OIDC Provider.

Experience in the DEV Team

(!) Limited experience in small projects

(!) Limited experience in small projects

Gut Feeling

More modern, maybe less technical debt inside

9.4. ADR-003 All-in-One Image

Date

06.08.2024

Status

Done

Context

Currently, shepards front- and backend are build, published and run as two separate containers. This leads to effort for administrators because they have to maintain two services in their docker compose file. Even with an integrated image administrators still need to maintain a docker-compose file for the databases and reverse proxy.

Exposing two images of basically the same implementation is exposing an implementation detail of shepard to users. Backend and frontend always have the same version number as they share a release process. This could be mitigated by adding a variable to the docker-compose file.

Both services have similarities in their configuration, e.g. they both need the OIDC authority. The frontend receives the backend url (which the backend could also use, e.g. for generating a OpenAPI spec with the base URL).

Usually Docker containers should follow the single responsibility principle and have one process per container. From https://docs.docker.com/config/containers/multi-service_container/:

It’s ok to have multiple processes, but to get the most benefit out of Docker, avoid one container being responsible for multiple aspects of your overall application. You can connect multiple containers using user-defined networks and shared volumes.

The frontend does not have it’s own process apart from nginx since it’s only static html, css and javascript files.

Scaling is easier with separate images. Since there is not a lot of server-side load in the current frontend individual scaling is not important.

Building an integrated image involves more effort than publishing two separate images following the best practices of their frameworks.

If future frontend developments add separate UIs additional efforts for administrators or efforts in integration are necessary.

As a full stack developer I want the current version of the frontend to develop vertical features.

Possible Alternatives

  1. Keep separate images for frontend and backend

    • No change necessary

    • Easier to maintain from a dev perspective

    • Admins still have to maintain two images

  2. Merge front- and backend in one image

    1. Adding nginx and the static frontend files to the backend image

      • Violates the one process per container principle

    2. Putting the frontend into the Backend

    3. Keeping the frontend seperate and adding the compiled files to quarkus before bundling

      • Frontend-Development stays as simple as currently

      • There may be differences between dev and deployed frontend build-wise

    4. Publish a frontend package and include it in Backend

      • Similar to home-assistant

Decision

We keep the seperate images for now and will revisit the topic when we work on the facilitate deployment. By then, we expect to have a new frontend setup, so that we also save duplicate efforts by postponing the topic for now.

Consequences

  • We will switch the backend and frontend images to the monorepo immediately.

  • We will add a version variable in the infrastructure repo to ease switching the version for administrators

9.5. ADR-004 Prefer Constructor Injection

Date

19.08.2024

Status

Done

Context

Quarkus supports dependency injection that we want to use. It supports loosely coupled components providing a better flexibility, modifiability and testability. In general there are two possibilities to use DI, constructor and member injection.

Decision

We decided to use constructor injection.

  • When creating an instance manually we directly see the dependencies.

  • There are existing components that make use of injected configuration values and services in the constructor. That is not possible with member injection because the constructor is executed first.

  • During the migration process we can directly see if someone is still using the standard constructor which should not be the case.

Possible Alternatives

Using the @Inject annotation on non private members.

  • Quarkus does not recommend injection on private members. In this case quarkus has to use reflection which complicates with the optimization for native images.

  • When developers create an instance manually by using the default constructor, the dependent beans are not set which leads to strange NullPointer exceptions during runtime.

Consequences

  • An additional parameterless constructor must be defined.

  • Adding a new dependent service means to modify the existing constructor.

9.6. ADR-005 Frontend Tech Stack

Date

30.08.2024

Status

Done

Context

shepard provides a frontend. Until now, the frontend basically provides a UI for the backend API. In the future, the frontend will provide useful features easing the interaction with shepard, especially for non-tech-savvy users.

The application may also plot timeseries data. The available data may contain a lot more data points than the amount required for a rendered graphic.

We don’t necessarily need SEO or server-side rendering.

We also want to achieve benefits for API users when developing server-side code for the frontend.

Since there is not a lot of developers working on the project, maintainability is very important.

For OIDC, authentication with client secrets may be needed to operate with more OIDC providers.

We want an easy to understand and maintainable structure for the frontend.

The current frontend is written in Vue.js 2. Vue.js 2 reached End of Life in the end of 2023. It already uses the composition API to ease migration to Vue.js 3. It is not possible to update to TypeScript 5 due to incompatibilities. When updating to Vue.js 3, vue router and Vuex have to be updated or replaced.

Because of the already existing frontend and the experience in the dev team, we want to stay in the Vue ecosystem.

As a UI library, BootstrapVue (based von Bootstrap 4) is used. Bootstrap 4 reached end of life in the beginning of 2023. BootstrapVue is incompatible with Bootstrap 5 and cannot be updated.

Possible Alternatives

In this ADR two decisions are taken. One for the JavaScript framework for the frontend and another for the UI library.

See here for the comparison of frameworks.

See here for the comparison of UI libraries.

Decision

We decide to use Nuxt as a JavaScript framework because of the broad use, opinionated defaults & structure while still being open for extension, e.g. to choose the best UI library available.

As a UI library we choose Vuetify based on it’s versatility and broad use.

Consequences

  • Because of the number of updated or replaced dependencies, we will setup a fresh frontend with the desired tech stack and migrate the components step by step

  • The migration of the current frontend functionality has to be planned

  • Current and new frontends have different Vue versions but npm workspaces are not able to properly handle different versions of the same dependency in two workspaces.To overcome this issue, we exclude the frontend from the workspaces and adjust the pipelines. This will cause the least effort and does not hinder active development.

  • We need to make sure to distribute Nuxt knowledge in the team

9.6.1. Appendix

Table 2. Comparison of possible frameworks
Vue.js 3 Vue.js 3 + Nuxt.js Vue.js 3 + Quasar

Short description

Vue.js is an open-source front end JavaScript framework for building user interfaces and single-page applications.

Nuxt is a free and open source JavaScript framework based on Vue.js, Nitro and Vite. The framework is advertised as a "Meta-framework for universal applications". It allows server-side rendering and server-side code in API routes.

Quasar is an open-source Vue.js based framework for building apps with a single codebase. It can be deployed on the web as a SPA, PWA, SSR, to a Mobile App, using Cordova for iOS & Android, and to a Desktop App, using Electron for Max, Windows and Linux.

Setup & Migration effort

Folder structure from Vue2 can probably be reused. Setting up a new project with Vue3 recommended defaults is possible with official Vue tools. Well documented migration path for switching from Vue2 to Vue3.

Setup probably easier than Vue alone because Nuxt comes with a lot of defaults/recommendations for folder structure, routing, state etc. The migration effort may be a little bit higher because the defaults & recommendations may differ from the current application.

Quasar brings it’s own CLI tooling. Therefore, the initial setup is easily done. Migration is probably harder, since quasar uses it’s own UI framework and we might have to use that.

Dev Complexity

Freedom of choice for many project decisions. Allows flexibility when creating applications, but has the risk of making the wrong decisions or implementing features in a non-optimal way (i.e., project structure). If you are already familiar with Vue, there is no need to learn a new framework.

Added complexity because it’s not just JavaScript on the browser anymore, we have to think about code running on the server and on the client. API routes & middleware may be handy, but provide a second place to implement server-side functionality.

Quasar offers some functionality over plain Vuejs. Therefore, the complexity might be a little higher. On the other hand, everything comes out of one box, so there is less confusion to find answers to potential questions.

Dev Joy (awesome tooling)

New projects should use Vite, which integrates well with Vue (same author). Vue provides its own Browser DevTools and IDE support. With vue-tsc TypeScript typechecking is possible in SFCs. Vue is a well documented framework with a large community and many community tools.

Integrated tooling and clear structure do spark joy.

There is only one documentation to be familiar with. However, a potential vendor lock-in might reduce the dev experience.

Application Structure provided by the framework (Opinionated Architecture)

Vue does not restrict on how to structure your code, but enforces high-level principles and general recommendations.

Nuxt comes with a lot of defaults/recommendations for folder structure, routing, state etc. It’s also easier to keep the app consistent with this structure in mind. We have do document less things ourselves when we follow the recommended structure.

Quasar offers a default structure and recommendations, but without implications for routing.

OIDC with client secret

Vue itself does not provide any authentication or OIDC mechanisms. You’d have to rely on external libraries and tools. Those tools probably cannot use a client secret as all code is delivered to the client.

Can work, probably with nuxt-oidc-auth or authjs.

Quasar offers no special functionality for authentication.

Stable, backed by big Community

According to the StackOverflow survey from 2024, Vue is currently the 8th most popular frontend framework (source). It has a large community and many sponsors. Since Vue3.js is the third iteration of Vue.js, it improved a lot over the years and has solved many previous problems.

According to stateofjs Nuxt is among the most used meta frameworks. They seem to have learned to provide good major update experience and try to make the next major update to version 4 as pleasant as possible.

Quasar is well-known and has some well-known sponsors.

License / free to use

MIT License

MIT License

MIT License

Server Resource Need

Even though Vue has support for SSR, its main focus is often on SPA. Therefore, depending on how exactly the frontend is implemented, the server resources may be lower than the resource need of Nuxt or Quasar.

More resources needed than for hosting an SPA. May need to be scaled individually in bigger setups.

Probably same as nuxt. Quasar is designed with performance in mind.

Adminstration Complexity

Nothing special

Can probably run just as good as the frontend in a docker compose setup, as long as it doesn’t need to be scaled.

Nothing special

Experience in the DEV Team

Already developed the old frontend in Vue.js 2, Vite, composition API and script setup. Experience with some Vue.js 3 component development.

Played around with Nuxt a little bit. Previous experience with Next.js and modern JavaScript meta framework approaches.

Only known from documentation.

Gut Feeling

Nuxt integrates on a rather low level, gives us a structure we can follow and integrate into, is in broad use.

Further Notes

  • Vue also provides features of Nuxt and Quasar like: SSG pre-rendering, full page hydration and SPA client-side navigation (source). However, the documentation itself mentions that many of these features are "intentionally low-level", and that Nuxt and Quasar provide a "streamlined development experience".

  • The latest version of Vue (3.x) only supports browsers with native ES2015 support. This excludes IE11. Vue 3.x uses ES2015 features that cannot be polyfilled in legacy browsers, so if you need to support legacy browsers, you will need to use Vue 2.x instead.

  • Nuxt 4 is to be released soon and we could already opt into it’s new features saving us a major release migration

  • Nuxt recently added server-side components that do not execute JS on the client side called NuxtIsland.

Quasar was born because I felt that a full featured framework to build responsive websites, PWAs (Progressive Web Apps), Mobile Apps (Android, iOS) and Electron apps simultaneously (using same code-base) was missing. So I quit my job to make this idea a reality. – Razvan Stoenescu, Mon 25th Oct 2015

Further resources:

Table 3. Comparison of possible UI libraries
Bootstrap 5 primevue vuetify Nuxt UI tailwind

Links:

Migration effort

high

high

high

high

high

Easy to use / Versatility

No wrapper library for vue. StackOverflow suggests to use bootstrap components directly in vue templates without wrapper lib. This is not the vue-way to get things working.

There are many components available and PrimeVue has a direct Vite and Nuxt integration. The tutorials imply that it is extremely easy to create a beautiful webpage. However, it is not so clear how far one can come without paying money on pre-defined layouts and UI building blocks.

Vuetify seems to be extremely versatile and provides a lot of options and a comprehensive documentation.

There are quite some components available, theming and customization seems reasonable.

We would need to define our own UI library, so it’s probably too much effort.

Theming (setting colors, spacing & global style overrides)

Bootstrap has predefined themes that can be bought.

Has styled and unstyled mode for components. Styled mode utilizes pre-skinned components and default primary color. Unstyled mode allows complete control over CSS properties for components or integration of i.e., tailwind css.

A custom theme including colors can be defined as described here. Additionally, global or component-specific overrides can be defined as described here.

NuxtUI enables to set colors and style overrides in a the Nuxt config (See here).

Custom CSS Styling for Components

A lot of things can be custoimized via SASS and CSS.

See unstyled mode above.

Class attribute can be set on components. Vuetify uses sass and has a lot of utility classes.

NuxtUI allows to set a class attribute on components to add classes as well as setting a ui prop to define custom styles.

Effort to adapt to potential style guide (consult with UE)

Can opt-in for styled mode, meaning components come pre-skinned after the Aura style guide (can be changed).

Backed by large community / future proof

Bootstrap itself is still popular. However, the BootstrapVue library still struggles with vue3 and bootstrap 4. There are plans to support bootstrap 5, but they are delayed as of now: Roadmap

License

MIT License

MIT License for OpenSource parts (PrimeVue, core, icons, themes, nuxt-module, auto-import-resolver, metadata)

MIT License

MIT License (for Nuxt UI without Pro)

Free to use

Bootstrap is free to use. Predefined Themes can be bought.

Not all components are free to use. Single PrimeBlocks (UI building blocks) licenses cost $99/ developer, for small teams it is $349. Allows access to the Figma UI toolkit and Vue UI Blocks (UI building blocks). Single layout templates can be purchased on their own.

Yes

Not all components. There is a set of components only available in Nuxt UI Pro, especially for dashboards, Layouts etc. Nuxt UI Pro also contains templates.

figma or sketch UI kit available

There is a Bootstrap 5 UI kit including Bootstrap Icons

Not for free

There is a figma ui kit available for free here. There are additional UI kits available to buy here

There is a figma ui kit available here.

Gut Feeling

There are better alternatives available.

Vuetify is very popular and seems to support a lot of stuff and has extensive documentation.

Not known yet, may be not grown enough.

Further Notes

Vuetify also has a plugin for a quick integration into Nuxt, see here.

9.7. ADR-006 Removing Cpp client from repository

Date

26.08.2024

Status

Done

Context

The shepard backend code generates an OpenAPI specification document that represents the REST API with all possible requests, responses and parameters. Using this OpenAPI specification we are able to generate clients that follow this definition of requests. These clients are automatically generated and are able to communicate with the shepard REST API.

The clients are generated by an external tool called OpenAPI Generator. This tool allows to generate these client in multiple programming languages.

Until now, we supported, maintained and provided clients for Java, Python, TypeScript and C++. The reasoning for the choice of these clients is provided in the Appendix A.

Decision

We decided to remove the Cpp client from the shepard repository. This immediately takes effect. The last valid Cpp client package is the one from the 2024.08.05 release. Meaning that future releases no longer provide a working Cpp client.

This decision was made for two major reasons.

First, the general usage of the Cpp client is low, since the Cpp client was introduced for a few specific use-cases. Meaning that the client is rarely used and has less importance than the other clients.

Second, the amount of work to maintain the Cpp client has gotten too large. It is hard to maintain and easy to break.

We encountered problems with the client generation due to changes in the OpenAPI specification. For all clients, these specification changes resulted in breaking changes on the clients. For the other clients (Java, Python, TypeScript) these breaking changes can be documented and fixed by end-users to keep a working version of the client. Even more important, the client building/ compiling is not affected by the OpenAPI changes. So the behavior of these clients changed, but the clients themselves are still working and can be be built. The OpenAPI changes however have a different impact on the Cpp client. Meaning that the compilation of the Cpp client fails, which renders it useless for now.

The rest of this section provides a technical overview of the specific problems that occur when compiling the Cpp client. The main problem here is the implementation of enum types in the OpenAPI generator. The following snippet shows how older versions of the backend (pre-quarkus) generated enum types like this orderBy query parameter:

Previous OpenAPI Enum Declaration
paths:
  /examplepath:
    get:
      parameters:
        - name: orderBy
          in: query
          schema:
            enum:
              - createdAt
              - updatedAt
              - name
            type: string

In the OpenAPI specification that is generated by the new quarkus backend, most enum types have their own type and are defined like this:

Quarkus OpenAPI Enum Declaration
paths:
  /examplepath:
    get:
      parameters:
        - name: orderBy
          in: query
          schema:
            $ref: "#/components/schemas/DataObjectAttributes"
components:
  schemas:
    DataObjectAttributes:
      enum:
        - createdAt
        - updatedAt
        - name
      type: string

Even though these two OpenAPI specifications are semantically the same, the Cpp client building fails, because the OpenAPI generator is not able to implement certain methods like a parameterToString method for this custom object. Previously this did not fail, since the orderBy was directly declared as an enum type and did not utilize a proxy-object, so that the C++ compiler knew how to create the parameterToString method.

Fixing the client compilation by manually patching the client is possible. However, this patching requires a larger amount of work to maintain the patch for every release of shepard.

Possible Alternatives

It is possible for end-users, who still want to use the Cpp client, to come up with their own patches. Basically there are three approaches:

  • Implement the missing C++ methods (i.e. the parameterToString) for all custom enum objects. This requires a lot of effort and knowledge of Cpp. Also, it does not scale well for future changes of the OpenAPI specification.

  • Modify the client generation through custom template files. The OpenAPI generator allows to modify the generated C++ code to generate the missing methods related to the enum types. For this, please refer to the official OpenAPI generator documentation. This scales better with future changes of the OpenAPI specification, but requires knowledge about the OpenAPI generator templating engine.

  • Modify the OpenAPI specification to directly implement enum types without relying on proxy-objects. For this you have to identify the affected enum types and replace their enum definition, i.e., converting the enum definition from Quarkus OpenAPI Enum Declaration to Previous OpenAPI Enum Declaration. After modifying the OpenAPI specification, the client needs to be regenerated. This requires medium effort and knowledge of the OpenAPI specification. It could theoretically be automated with scripts.

Consequences

  • The Cpp client is no longer supported by this repository.

  • The Cpp client no longer gets published in the package registry.

  • The latest Cpp client in the package registry is the one from the 2024.08.5 release. Even after successfully patching it, so that the client compilation works again, it could stop working in future releases if the API changes.

Appendix A: Generated Clients

Client Language

OpenAPI Generator

Reason for Usage

Python

Python

Python is a well-known and accepted programming language in the scientific community. Many researchers have experience in Python. Furthermore, the Python ecosystem and its community is well established and provide many resources to learn Python. Additionally, Python is a programming language that allows fast prototyping, since it is an interpreted language and typing is optional.

TypeScript

typescript-fetch

TypeScript as a typed superset of the JavaScript programming languages. One of its main purposes is to created web oriented applications. In the shepard project the TypeScript client is used as a library in the shepard frontend, which implements functions and types needed for communicating with the backend REST API. This saves the time and effort when developing the shepard frontend, since every change of the REST API is automatically reflected to this client through the OpenAPI specification.

Java

Java

Can be seen as an alternative to Python. It is widely adopted and used. Many people have a experience in Java. It has a large ecosystem and is a good fit for standalone applications.

Cpp/ C++

cpp-restsdk

Generally, C++ enables creating performant standalone applications. In this project, the Cpp client was used for some specific use cases and not selected based on other general factors.

9.8. ADR-007 Client Generation for Frontend

Date

30.08.2024

Status

Done

Context

Until now, the frontend used the published typescript client to interact with the backend. With the change to the monorepo, we could make atomic changes to the backend and frontend in one commit. This behaviour is not possible though, because the frontend needs the updated client package.

In order to mitigate this, we want to make the typescript client used by the frontend more dynamic.

For the generation of the client either Docker, Podman or Java is required.

Possible Alternatives

Provide a script to (re)generate the client based on a local or remote backend instance and

  1. do not put it under version control.

    • everybody would need to be able to generate the clients (that means either docker, podman or Java needs to be installed)

    • we have no way to make sure everybody is synchronized with the the backend

    • the frontend built in the pipeline will always use the correct client but may throw errors if it was not adapted properly

  2. put it under version control.

    • the client would need to be regenerated in case the backend changes

    • pure frontend developers do not need to have Java or Docker installed

    • we can add a pipeline job checking if the generated client and the backend are in sync

Decision

We decide to go with option 2 and put the generated client under version control.

Consequences

  • The client needs to be added to the repo and the imports in the frontend need to be adapted

  • a pipeline job to check if the generated client and the backend are in sync needs to be added

9.9. ADR-008 Database Target Architecture

Date

17.09.2024

Status

Done

Context

Current state

At the moment shepard uses three different databases:

  • Neo4j (graph db)

  • MongoDB (document db)

  • InfluxDB (timeseries db)

What was the reason for choosing different databases?

  • In the very beginning the data was directly stored into the databases (influxdb, neo4j and mongodb), no domain model, just the data

  • In the second step the backend was created, the REST api was created and also the domain model

  • Special features of timeseries database are already in use (min, max, sum, etc.)

  • From a user perspective it feels easier to navigate through a graph database instead of a relational database

Known issues

  • We have to use three different database query languages

  • Maintenance of three different databases and their libraries

  • For backup you have to consider all three databases

  • Issues with Neo4j

    • When to load relationships with data objects and how many and how does it influence performance. You have to know how the ogm works.

    • We had some issues with caching that we do not fully understand.

    • Lack of a large ecosystem (e.g. only one migration library available (private one))

  • Issues with InfluxDb

    • We are using influxdb v1.8 atm.

    • New versions of influxdb are completely different (completely new query language, etc.)

    • Bad feeling about a shift to paid services.

    • The library that we use to communicate with influxdb lacks some important features like query injection prevention.

  • Issues with MongoDB

    • The update process needs manual steps

Possible Solutions

  1. We leave it as it is

  2. Neo4j + MongoDB

  3. Postgres only (Replace all database technologies with postgres)

  4. Neo4j + MongoDB + Postgres (replace influxdb with postgres timescaledb)

  5. Postgres + MongoDB (Replace influxdb and neo4j)

  6. Neo4j + Postgres (Replace influxdb and mongodb with postgres)

possible database architectures

Decisions

Decision 1: Leave it as it is This is not an option because of known issues with InfluxDB. We have to find a solution at least for that database.

Decision 2: Meta Data in Neo4j or Postgres

Neo4j

Postgres

Migration effort

None

Big

Onboarding of new developers

Rather big

Rather small

Familiarity in the team

The team is familiar with Neo4j

The team is not familiar with Postgres

Ecosystem

Not big

Huge ecosystem, frequently used for a long time

Maintenance effort

Big, as we will have additional databases for data storage

Small to medium, if we use Postgres for all data persistence

Performance

Comparable if properly used

Comparable if properly used

On the green field Postgres might be the better option with less maintenance effort and it’s big ecosystem. In the context of shepard we already have Neo4j, we would need to migrate data, the experience in the team is bigger for Neo4j. All in all we decide to continue with Neo4j.

Decision 3: Database for Timeseries & Spatial Data

MongoDB

Postgres

Migration Complexity

Rather easy, MongoDB is already there and we only have to migrate timeseries data

Medium migration effort

Performance

Performance is probably worse than Postgres for timeseries and spatial data

Support for spatial data

Only supports 2D spatial data, no trivial and performant way to support 3D

Summary

Not an option due to performance and spatial data

As MongoDB does not seem to perform well for timeseries and spatial data we decide to store timeseries (and in the future spatial data) in postgres with timescaledb and PostGIS.

Decision 4: Database for Files & Structured Data

Postgres for structured data + Blob storage (MinIO)(option 5)

Keep MongoDB(option 4)

Postgres(option 6)

Migration Complexity

All structured data and files have to be migrated

No migration effort

All structured data and files have to be migrated

Onboarding new Developers

New developers only have to know how to interact with two databases

Maintenance effort (updates of databases & clients)

We still have three databases to maintain

We still have three databases to maintain

We only have to maintain one database in addition to Neo4j

Reliability

Probably also very stable

MongoDB hasn’t been touched in months, so it reliably does it’s job

Probably also very stable

Performance (file size around 5-8 GB)

Good

Good

Unknown

Summary

This solution is not better than MongoDB, so it’s not feasible given the migration effort

Postgres supports two ways for storing binary data link (bytea column and LargeObject API).

For large files we have to use the LargeObject API. But in both cases the data is stored in a single table. For tables we have a limitation of 32 TB (per-table size limitation). If we want to store multiple projects in one shepard instance, we might exceed this limit. So we are not able to store large objects in postgres. The decision is to stay with MongoDB for files and structured data.

Consequences

  • We still have to support three different databases.

  • Complexity and maintenance costs are higher than with a single database, but just as high as now.

  • Same applies to the backup up of three databases.

9.10. ADR-009 Nuxt OIDC Library

Date

17.09.2024

Status

Done

Context

We want to implement authentication using OIDC in the new Nuxt-based frontend. We expect to authenticate with an existing Keycloak instance, similar to the old frontend.

In the future authentication with client secrets may be needed to operate with more OIDC providers.

For Nuxt 2 there was an auth module that is not yet available for Nuxt 3.

Possible Alternatives

  • authjs-nuxt

    • Provides integration with Auth.js for handling OAuth, OAuth2, OIDC, and custom authentication providers

    • ca. 3000 weekly downloads on npm, 250 GitHub stars

    • Last update a year ago

    • Not yet released with a 1.x.x version

  • nuxt-oidc-auth

    • Specifically for OpenID Connect (OIDC) authentication, focusing on OIDC-compliant providers like Keycloak, Auth0, etc.

    • ca. 600 weekly downloads on npm, 68 GitHub stars

    • Proper documentation as nuxt module here: https://nuxt.com/modules/nuxt-oidc-auth

    • Not yet released with a 1.x.x version

  • nuxt-auth-utils

    • Utility module for handling various auth strategies including JWT, session-based, OAuth/OIDC

    • ca. 5600 weekly downloads on npm, 805 GitHub stars

    • Last update very recent

    • Not yet released with a 1.x.x version

  • @sidebase/nuxt-auth

    • Full-featured authentication solution based on Auth.js, providing many built-in providers (Google, GitHub, Keycloak, etc.)

    • ca. 22000 weekly downloads on npm, 1200 GitHub stars

    • Uses authjs under the hood

    • Last update very recent

    • Not yet released with a 1.x.x version

Decision

We decide to go with @sidebase/nuxt-auth for its support of multiple built in providers including Keycloak and the superior documentation and community support.

Consequences

  • We have to gain knowledge about this module and how to efficiently use it.

  • We have to add two additional environment variables(AUTH_ORIGIN and NUXT_AUTH_SECRET), introducing a breaking change.

9.11. ADR-010 Postgres/Timescaledb Image

Date

07.10.2024

Status

Done

Context

We need to deploy a postgres image with the timescaledb plugin.

Possible Alternatives

  • Deploy postgres and install timescaledb manually following this guide

    • the manual effort is not feasible for administrators

  • Deploy postgres and add a script installing timescaledb automatically following the steps from the guide above

    • feels like a rather hacky and experimental solution

    • we are not sure if we ever need to adapt the script

  • Deploy the image provided by timescale (see here)

    • the image is bound to timescale

    • we need to think about migration in case we need to add additional plugins to postgres

Decision

We decide to use the timescale docker image due to simplicity.

Consequences

We may need to adapt the setup in the future in case we need additional plugins.

9.12. ADR-011 Timescale database schema

Date

30.10.2024

Status

Done

Context

1. API Design

The current API was designed with InfluxDB in mind and some technical aspects like InfluxPoint and SingleValuedUnaryFunction made it into data types.

Choices

We have to decide if we want to create a stable replacement for the existing API or if we create a second one that might differ.

2. How to persist metadata (location, device, symbolic name, etc.)

InfluxDB stores metadata as tags with every timeseries in one column. We have to store those metadata in timescale as well.

Choices

  • Store metadata in a separate table along with the containerId and field

    • When adding data it could get complex to identify if the desired timeseries does already exist as the api does not contain an id for timeseries. We would have to look through the timeseries table and check if this combination of measurement, device, location, symbolicName and field is already present.

  • Store metadata in one table with the timeseries payload entries.

    • This leads to redundant data as every measuring point value in a timeseries shares the same metadata.

  • Use one column and store metadata in a JSONB format.

3. How to store the measuring point value (data type)

InfluxDB could store an Object as a field that could contain multiple types (Boolean, String, Integer, Double). TimescaleDB can only support types known to Postgres, therefore no Object type support. In InfluxDB, the field type per field and measurement is fixed and stored internally.

Choices

  • Have a dedicated column for the value and a seperate column holding the type of the data point.

  • Use JSONB for the value which will be type aware.

  • Use different columns for different types (one column per type).

  • Use different tables for each type and link it to the value table (one table per type).

  • Use one column for the value without persisting information about the type

4. OR Mapper or Handwritten Queries

We need to decide if we want to use an OR mapper or if we have to handwrite our queries.

Choices

  • OR Mapper (Panache, Hibernate)

    • How does it work with timescaledb?

    • Entities do need a primary key column which is not necessary for data points because they are not an entity.

  • Handwritten Queries

    • Since we need to reproduce the old API and the functions previously offered by InfluxDB, we may need to manually write some of the queries anyway.

Decisions

  1. We try to keep the API as stable as possible to minimize the effort of adaptions for the end users. Nevertheless, some changes are expected because we agreed on removing InfluxDb specific things like data types. We have to communicate those changes to the community.

  2. We use two tables to persist the data, timeseries table and timeseries_data_points.

    • timeseries contains the Tags metadata, containerId and measurement.

    • timeseries_data_points contains the timeseries data with their timestamps and the value itself.

    • This approach will gain writing performance since there will be no metadata duplication.

    • When fetching the data, we gain performance by using only one index in the timeseries_data_points table

  3. We use a seperate column for each type of data in the timeseries_data_points table

    • The possible types are String, Integer, Double and Boolean.

    • This is decided to be able to use the aggregate functions (MAX, MIN, COUNT, …) for Integer and Double

  4. We use the ORM when possible and handwritten queries if not

    • When fetching the payload, we need to handwrite the query to be able to easily execute the aggregate functions.

9.13. ADR-012 Lab Journal feature

Date

21.11.2024

Status

Done

Context

1. How to persist the Lab Journal content

We need to implement a solution for storing content of the Lab Journal feature in shepard. This feature will allow users to create, store, and manage lab journal entries that will be related to DataObjects. There are several options for how to persist the content of the lab journal and integrate it with the current shepard state. The main considerations include ease of implementation, maintainability, performance, and how well the solution integrates with our existing domain model and infrastructure.

1.1. Choices

1.2. 1. Use Existing REST API and Implement Logic in Frontend

Advantages:

No changes in the backend necessary.

Disadvantages:

  • Business logic is implemented in the frontend.

  • Mixing DataObjects for experimental data with DataObjects for lab journal entries.

  • Lab Journal entries would appear in the Treeview, causing clutter.

  • Filtering out lab journal entries from DataObjects must be done by users of the REST API, which destroys pagination.

  • Using StructuredData containers would result in one container for all lab journal entries, not related to individual collections.

1.3. 2. Create New REST APIs for Lab Journal Using Existing Domain Model

Advantages:

  • No changes in the existing domain model.

  • Business logic is stored in the backend.

  • Simplifies use cases like filtering, searching, and pagination with new REST API.

  • Can utilize pagination and filtering of the existing REST API for containers.

Disadvantages:

  • Pollutes the data objects tree view if implemented with DataObjects.

  • Only one global container for all lab journal entries if implemented with Structured Data containers.

  • Permissions are made for the global container, not for each collection separately.

  • Filtering requires accessing properties within the JSON document.

1.4. 3. Store content in Neo4j using an independent service with new REST API

Advantages:

  • Relationship between DataObject and Lab Journal entry can use database references.

  • Core domain model is located in the neo4j database.

  • More control and freedom in designing data model classes.

  • Clear and clean separate endpoint for the special feature.

  • Possibility to store images from the description in a file container or elsewhere.

Disadvantages:

More implementation effort for filtering, pagination, new model, permissions, etc.

1.5. 4. Create Independent Service with New REST API Stored in Postgres

Advantages:

  • Completely isolated from the rest of the application (microservice approach).

  • Easily testable and maintainable.

  • More control and freedom in designing data model classes.

  • Clear and clean separate endpoint for the special feature.

  • Possibility to store images from the description in a file container or elsewhere.

Disadvantages:

  • More implementation effort for filtering, pagination, new model, permissions, etc.

  • Lose the direct database reference for connecting the DataObjects and lab journals.

  • Separate storage from the related business domain objects.

1.6. 5. Extend Existing Data Object Model

Advantages:

  • No changes to the existing endpoints.

  • Lab journals are included automatically in the data object.

  • Easy to implement.

Disadvantages:

  • Hard to maintain field meta info like created-by, created-at, etc.

  • Loading the needed lab journals could be complex and resource-intensive at the collection level.

  • Could be overcome by creating an endpoint to get all lab journals and apply the needed filter.

Decision

We have decided to implement Option 3: Store content in Neo4j using an independent service with new REST API.

  • Neo4j allows us to leverage database references to establish relationships between Collections, DataObjects and Lab Journal entries.

  • It provides a clear and clean separate endpoint for the Lab Journal feature, making it easier to manage and maintain.

  • This decision aligns with our goal of maintaining a clean and organized domain model while providing the necessary functionality for the Lab Journal feature.

Consequences

  • We will need to invest more effort in implementing filtering, pagination, new model, permissions, etc.

9.14. ADR-013 Editor Library

Date

26.11.2024

Status

Done

Context

For the lab journal feature, we need an editor in the new frontend. In the old frontend, tiptap was used to edit & render descriptions of collections and data objects. Editing lab journals needs more features than the descriptions in the old frontend, e.g. tables & images. In the new frontend, we want one editor for both descriptions and lab journal entries for consistency.

Possible Alternatives

editor.js tiptap lexical CKEDitor TinyMCE quill

General Information

GitHub Stars

29k

27k

20k

9,6k

15k

43k

npm downloads

110k

1200k

667k

800k

656k

1600k

Good Look & Feel

Yes

Yes

Demo looks great, but is React based

Yes

Yes

Very basic

Standard Text Editor

Block based

Standard WYSIWYG

Block Based

Standard WYSIWYG

Standard WYSIWYG

Standard WYSIWYG

Documentation Quality

Great Docs

Good Docs

Not great, Only few examples

Formatting Bar on Top Possible

Even with manual effort not easily, see here

Does not provide a toolbar on its own, has to be implemented and designed by ourselves like this custom-menus or this

Not easily

Yes

Yes

Lists & Tables

Yes

Yes

Yes

Yes

Yes

Images in Text

Yes

Yes

The playground supports it (with a custom plugin in the playground), but the documentation does not make it clear how image inclusion and file upload work. The playground is written in react, the vue plugin does not have the image stuff

Yes

Yes

Image Upload per Drag & Drop

Yes

Manually possible following this

The playground supports it (with a custom plugin in the playground), but the documentation does not make it clear how image inclusion and file upload work. The playground is written in react, the vue plugin does not have the image stuff

Data Validation possible (make sure data can be validated at the api to make sure the frontend can render it)

Yes, The block format allows for easy data validation and a clear API documentable in the OpenAPI spec

Kind of, it emits html

Kind of, it emits html

Kind of, it emits html

No, since TinyMCE is very cloud focused and expects to talk with it’s own backend

Vue Compatible

Yes

Yes

Yes

Yes

Yes

Active Development

Yes and No (No response in the toolbar feature request for more than a year )

Yes

Yes

Yes

Yes

Yes

Open Source, best case no freemium model

Yes

Yes but freemium stuff (that we don’t need)

Yes

Yes but Freemium stuff

Yes but Freemium & cloud stuff

Yes

Not blocking potential export

Yes

Yes

Yes

Yes

Not clear how to export data to save it in the database ourselves

Not blocking potential full text search

Yes

Yes

Yes

Yes

Yes

Yes

Decision

We decide to go with tiptap since

  • we save migration effort for old descriptions,

  • it is possible with reasonable effort to add a toolbar to our liking,

  • implement an image upload functionality tailored to shepard

  • CKEditor & TinyMCE seemed quite commercial,

  • lexical did not have great documentation & vue support,

  • quill does not have production-ready vue support and

  • editor.js is missing a way to have the toolbar we need without a hacky solution.

Consequences

  • We can migrate the old description mechanism to the new frontend almost as is

  • We need to think about data validation for lab journal entries

10. Quality Requirements

10.1. Quality Tree

10.2. Quality Scenarios

11. Risks and Technical Debts

11.1. Risks

Propability: very unprobable, unprobable, unsure, propable, very propable

Costs: critical, expensive, unsure, cheap, negligible

ID Name Description Possible Actions Probability Costs Priority

11.2. Technical Debt

Software systems are prone to the build up of cruft - deficiencies in internal quality that make it harder than it would ideally be to modify and extend the system further. Technical Debt is a metaphor, coined by Ward Cunningham, that frames how to think about dealing with this cruft […​] - 21 May 2019, Martin Fowler (https://martinfowler.com/bliki/TechnicalDebt.html)

11.2.1. Technical Debt Process

The following process handles all technical debt except for dependency updates, which have their own process. Besides dependency updates, every other technical debt is to be documented in the form of an issue in the backlog.

If a technical debt has been identified and a corresponding issue has been created in the backlog, the usual planning and prioritization process takes care of this debt. This makes the backlog the single source of truth for all known and unresolved technical debt.

Usually, the technical debt can be resolved in this way. In rare cases, it can happen that we want to keep this debt or decide that the debt is not really a problem for us. In these cases, the situation needs to be described in the table below and the corresponding issue can then be closed.

ID Name Description Solution Strategy Priority

1

Missing permissions

Back in the time when we started developing shepard, there was no authorization implemented. Therefore, not all Collections or Containers are guaranteed to have permissions attached. There is a fallback solution implemented in shepard to take care of such situations. We decided that no permissions means everyone has access.

A database migration could be implemented to add empty permissions to all entities. However, we should not make any assumptions about the actual permissions to avoid breaking existing datasets.

Low

2

Cryptic usernames

Depending on how the OIDC identity provider is configured, shepard sometimes generates very cryptic user names. The username is retrieved from the subject field of the JWT. While the subject is guaranteed to be unique, some identity providers generate a UUID, which makes the resulting username not very user-friendly. Keycloak uses subjects in the form f:173df088-e611-4535-827a-feb57457a5a6:haas_tb, where the last part is the actual username as it appears in the active directory. Therefore it seemed to be a good idea to use only the last part as username in shepard. However, this logic breaks as soon as another identity provider is used.

Since the username is used to identify users in shepard, it is not so easy to change it. A migration would be possible if shepard could fetch all available users from the identity provider and then migrate all users at once. However, this is not possible with the current configuration. Keycloak adds a preferred_username field to the JWTs, but this is an addition that comes from Keycloak and is not specified by OIDC.

High

12. Glossary

Term (EN) Term (DE) Definition Sources

AFP

AFP

Automated Fiber Placement

Collection

Collection

A collection consists of multiple Data Objects.

Container

Containers allow users to store data. There are different types of containers, e.g. TimeseriesContainer, StructuredDataContainer and FileContainer.

Context

Context

The context defines which Data Objects belongs together and are related to an experiment.

Data Management Plan

Datenmanagementplan

Data Object

Data Object

Represents one piece of information. A DataObject belongs to exactly one Collection. DataObjects can have multiple attributes describing the information. A DataObject may have predecessors, successors or childs. There is only one parent allowed.

End effector

Endeffektor

The device at the end of a robotic arm

Entities

Entities are used to manage connections between payloads. It is an abstract term. The concrete instances are Collections and DataObjects.

Experiment

Experiment

An Experiment is a period in time where data is collected and stored for further investigation.

FAIR data principles

FAIR Prinzipien

FAIR data are data which meet principles of findability, accessibility, interoperability and reusability. The FAIR principles emphasize machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.

NDT: Non destructive testing

NDT: Non destructive testing

Tests with ultrasound, for example, that do not destroy the component

Ply

Schicht

One layer of tapes that lie next to each other.

Reference

Reference

A reference connects a Data Object to a concrete value type like documents, urls, timeseries, etc. A Data Object can have multiple References.

Shepard

Shepard

Acronym for: Storage for HEterogeneous Product And Research Data

Organizational Element

Organisationselement

Describe a group of elements that help organizing and structuring the uploaded data. These elements are Collections, Data Objects and References.