NCBI Biosample DuckDB Files on NERSC Portal

Portal URL:\ https://portal.nersc.gov/project/m3408/biosamples_duckdb/

NERSC Filesystem Path:\ /global/cfs/cdirs/m3408/www/biosamples_duckdb

Current Schema and 2025-01-08 Version

The latest 2025-01-08 DuckDB file was built using the make duck-all target in the following repository:

NCBI Metadata Makefile

The schema is subject to revision as improvements and refinements are made based on NMDC project needs.

Pipeline Overview

The process of converting NCBI Biosample XML data into DuckDB format consists of several key steps:

Downloading the XML Data
The latest biosample_set.xml.gz is retrieved from NCBI's FTP server.
Loading into MongoDB
The XML data is parsed and stored in a MongoDB collection (biosamples).
Extracting Relational Data
- A Python script extracts specific paths from MongoDB, focusing on key entities like attributes, curation details, descriptions, IDs, and links.
- The script converts nested JSON-like structures into structured tables.
Inserting into DuckDB
- The extracted data is written into DuckDB tables.
- Column types are inferred dynamically.
- New paths and fields are added as needed.

This pipeline is orchestrated using Makefiles and Python scripts, with batch processing optimizations to handle large datasets efficiently.

Additional post-processing steps may include annotation enrichment, schema adjustments, and metadata documentation.

Extracted Paths

The following paths are extracted from the NCBI BioSamples XML data and stored as tables in DuckDB:

BioSample
BioSample.Attributes.Attribute
BioSample.Curation
BioSample.Description.Comment.Paragraph
BioSample.Description.Organism
BioSample.Description.Organism.OrganismName
BioSample.Description.Synonym
BioSample.Description.Title
BioSample.Ids.Id
BioSample.Links.Link
BioSample.Models.Model
BioSample.Owner.Name
BioSample.Package
BioSample.Status

Omitted Paths

Some paths present in the XML data are not included in the DuckDB database, such as:

/BioSampleSet/BioSample/Owner/Contacts/Contact
/BioSampleSet/BioSample/Description/Comment/Table
/BioSampleSet/BioSample/Curation

These omissions might be due to complexity, relevance, or data volume considerations.

Entity Relationships for Extracted CURIes

The process of extracting CURIes from NCBI Biosamples involves multiple relational tables. The following diagram illustrates how asserted CURIes (regex-based extraction) and NER-extracted CURIes (Named Entity Recognition) are structured within the database.

Entity Relationship Diagram:

Extracted CURIes Schema Overview

This structure enables querying for both asserted and NER-extracted CURIes while maintaining links to the original attribute data.

ATTRIBUTE contains the original metadata from NCBI Biosamples, including content, attribute_name, harmonized_name, display_name, and unit.
CONTEXTS_TO_NORMALIZED_STRINGS maps attributes to their normalized representations.
NORMALIZED_CONTEXT_STRINGS ensures that CURIes and extracted terms are consistently linked.
CURIES_ASSERTED stores CURIes extracted via regular expressions from text.
CURIES_NER stores CURIes extracted via Named Entity Recognition (NER), with additional filters applied (is_longest_match = TRUE and subsumed = FALSE).

The coverage sum calculation is currently used to rank NER-extracted CURIes, but the methodology has limitations:

Assumption of One Row Per (`attribute.id`, `harmonized_name`)

The current ranking logic assumes a single row per attribute.id + harmonized_name pair, but real-world cases show multiple rows for the same ID.
Example (Sample 4585963 has multiple env_medium values):

content	attribute_name	id	harmonized_name	display_name
urban biome	env biome	4,585,963	env_broad_scale	broad-scale environmental context
university campus	env feature	4,585,963	env_local_scale	local-scale environmental context
dust	env_material	4,585,963	env_medium	environmental medium
drywall	material	4,585,963	env_medium	environmental medium

Challenges with Multi-Part Annotations

Some annotations include both a label and a CURIe, such as:

"soil [envo:123456]"

Since the coverage sum compares extracted labels against raw text length, cases like this may never reach 100% coverage.

To address these issues:

Redesign ranking logic to account for multiple rows per attribute.
Improve coverage calculations to recognize multi-part annotations.
Consider alternative ranking methods, potentially using ontology confidence scores.

This remains an active area of development, and refinements will be incorporated into future versions.

Previous Version

Over the years, different methods and database formats have been used to structure NCBI Biosample XML data (biosample_set.xml.gz) into queryable relational forms. These methods have included:

Direct transformation into MongoDB, followed by structured table extraction
XQuery & BaseX XML database approaches
Custom Python processing pipelines

These historical approaches have resulted in different database schemas and formats, including SQLite, PostgreSQL dumps, and now DuckDB. While the repository contains various relational representations of Biosample records, the focus going forward is on DuckDB as the primary format.

Users should be aware that past versions may differ in structure, and schema consistency across versions is not guaranteed. Users are encouraged to verify the schema details for their specific use case.

Intended Audience and Prioritization

This work is primarily supported by NMDC funding, with a focus on making the results useful for NMDC workflows. While public availability is provided, the documentation and tools may not be fully optimized for bug-free, end-user consumption.

Known Issues and Considerations

DuckDB files can be accessed via DBeaver, but note that double-clicking some tables for a preview may cause crashes in Ubuntu 22.
No indices are provided in the DuckDB versions, as they are typically unnecessary.
Additional XML paths may be incorporated in future iterations based on demand and complexity.

The conversion process remains an active area of development, and user feedback on schema utility and performance is welcome.