NCBI Biosample DuckDB Files on NERSC Portal

Portal URL:\ https://portal.nersc.gov/project/m3408/biosamples_duckdb/

NERSC Filesystem Path:\ /global/cfs/cdirs/m3408/www/biosamples_duckdb


Current Schema and 2025-01-08 Version

The latest 2025-01-08 DuckDB file was built using the make duck-all target in the following repository:

The schema is subject to revision as improvements and refinements are made based on NMDC project needs.

Pipeline Overview

The process of converting NCBI Biosample XML data into DuckDB format consists of several key steps:

  1. Downloading the XML Data
    The latest biosample_set.xml.gz is retrieved from NCBI's FTP server.

  2. Loading into MongoDB
    The XML data is parsed and stored in a MongoDB collection (biosamples).

  3. Extracting Relational Data

  4. Inserting into DuckDB

This pipeline is orchestrated using Makefiles and Python scripts, with batch processing optimizations to handle large datasets efficiently.

Additional post-processing steps may include annotation enrichment, schema adjustments, and metadata documentation.


Extracted Paths

The following paths are extracted from the NCBI BioSamples XML data and stored as tables in DuckDB:

Omitted Paths

Some paths present in the XML data are not included in the DuckDB database, such as:

These omissions might be due to complexity, relevance, or data volume considerations.


Entity Relationships for Extracted CURIes

The process of extracting CURIes from NCBI Biosamples involves multiple relational tables. The following diagram illustrates how asserted CURIes (regex-based extraction) and NER-extracted CURIes (Named Entity Recognition) are structured within the database.

Entity Relationship Diagram: Entity Relationship Diagram

Extracted CURIes Schema Overview

This structure enables querying for both asserted and NER-extracted CURIes while maintaining links to the original attribute data.

  1. ATTRIBUTE contains the original metadata from NCBI Biosamples, including content, attribute_name, harmonized_name, display_name, and unit.
  2. CONTEXTS_TO_NORMALIZED_STRINGS maps attributes to their normalized representations.
  3. NORMALIZED_CONTEXT_STRINGS ensures that CURIes and extracted terms are consistently linked.
  4. CURIES_ASSERTED stores CURIes extracted via regular expressions from text.
  5. CURIES_NER stores CURIes extracted via Named Entity Recognition (NER), with additional filters applied (is_longest_match = TRUE and subsumed = FALSE).

Coverage Sum Calculation: Issues and Refinements

The coverage sum calculation is currently used to rank NER-extracted CURIes, but the methodology has limitations:

Assumption of One Row Per (attribute.id, harmonized_name)

The current ranking logic assumes a single row per attribute.id + harmonized_name pair, but real-world cases show multiple rows for the same ID.
Example (Sample 4585963 has multiple env_medium values):

content attribute_name id harmonized_name display_name unit
urban biome env biome 4,585,963 env_broad_scale broad-scale environmental context
university campus env feature 4,585,963 env_local_scale local-scale environmental context
dust env_material 4,585,963 env_medium environmental medium
drywall material 4,585,963 env_medium environmental medium

Challenges with Multi-Part Annotations

Some annotations include both a label and a CURIe, such as:

"soil [envo:123456]"

Since the coverage sum compares extracted labels against raw text length, cases like this may never reach 100% coverage.

Next Steps for Refinement

To address these issues:

This remains an active area of development, and refinements will be incorporated into future versions.


Previous Version

Over the years, different methods and database formats have been used to structure NCBI Biosample XML data (biosample_set.xml.gz) into queryable relational forms. These methods have included:

These historical approaches have resulted in different database schemas and formats, including SQLite, PostgreSQL dumps, and now DuckDB. While the repository contains various relational representations of Biosample records, the focus going forward is on DuckDB as the primary format.

Users should be aware that past versions may differ in structure, and schema consistency across versions is not guaranteed. Users are encouraged to verify the schema details for their specific use case.


Intended Audience and Prioritization

This work is primarily supported by NMDC funding, with a focus on making the results useful for NMDC workflows. While public availability is provided, the documentation and tools may not be fully optimized for bug-free, end-user consumption.


Known Issues and Considerations


The conversion process remains an active area of development, and user feedback on schema utility and performance is welcome.