Portal URL:\ https://portal.nersc.gov/project/m3408/biosamples_duckdb/
NERSC Filesystem Path:\
/global/cfs/cdirs/m3408/www/biosamples_duckdb
The latest 2025-01-08 DuckDB file was built using the make duck-all target in the following repository:
The schema is subject to revision as improvements and refinements are made based on NMDC project needs.
The process of converting NCBI Biosample XML data into DuckDB format consists of several key steps:
Downloading the XML Data
The latest biosample_set.xml.gz is retrieved from NCBI's FTP server.
Loading into MongoDB
The XML data is parsed and stored in a MongoDB collection (biosamples).
Extracting Relational Data
Inserting into DuckDB
This pipeline is orchestrated using Makefiles and Python scripts, with batch processing optimizations to handle large datasets efficiently.
Additional post-processing steps may include annotation enrichment, schema adjustments, and metadata documentation.
The following paths are extracted from the NCBI BioSamples XML data and stored as tables in DuckDB:
BioSampleBioSample.Attributes.AttributeBioSample.CurationBioSample.Description.Comment.ParagraphBioSample.Description.OrganismBioSample.Description.Organism.OrganismNameBioSample.Description.SynonymBioSample.Description.TitleBioSample.Ids.IdBioSample.Links.LinkBioSample.Models.ModelBioSample.Owner.NameBioSample.PackageBioSample.StatusSome paths present in the XML data are not included in the DuckDB database, such as:
/BioSampleSet/BioSample/Owner/Contacts/Contact/BioSampleSet/BioSample/Description/Comment/Table/BioSampleSet/BioSample/CurationThese omissions might be due to complexity, relevance, or data volume considerations.
The process of extracting CURIes from NCBI Biosamples involves multiple relational tables. The following diagram illustrates how asserted CURIes (regex-based extraction) and NER-extracted CURIes (Named Entity Recognition) are structured within the database.
Entity Relationship Diagram:

This structure enables querying for both asserted and NER-extracted CURIes while maintaining links to the original attribute data.
ATTRIBUTE contains the original metadata from NCBI Biosamples, including content, attribute_name, harmonized_name, display_name, and unit.CONTEXTS_TO_NORMALIZED_STRINGS maps attributes to their normalized representations.NORMALIZED_CONTEXT_STRINGS ensures that CURIes and extracted terms are consistently linked.CURIES_ASSERTED stores CURIes extracted via regular expressions from text.CURIES_NER stores CURIes extracted via Named Entity Recognition (NER), with additional filters applied (is_longest_match = TRUE and subsumed = FALSE).The coverage sum calculation is currently used to rank NER-extracted CURIes, but the methodology has limitations:
attribute.id, harmonized_name)The current ranking logic assumes a single row per attribute.id + harmonized_name pair, but real-world cases show multiple rows for the same ID.
Example (Sample 4585963 has multiple env_medium values):
| content | attribute_name | id | harmonized_name | display_name | unit | 
|---|---|---|---|---|---|
| urban biome | env biome | 4,585,963 | env_broad_scale | broad-scale environmental context | |
| university campus | env feature | 4,585,963 | env_local_scale | local-scale environmental context | |
| dust | env_material | 4,585,963 | env_medium | environmental medium | |
| drywall | material | 4,585,963 | env_medium | environmental medium | 
Some annotations include both a label and a CURIe, such as:
"soil [envo:123456]"
Since the coverage sum compares extracted labels against raw text length, cases like this may never reach 100% coverage.
To address these issues:
This remains an active area of development, and refinements will be incorporated into future versions.
Over the years, different methods and database formats have been used to structure NCBI Biosample XML data (biosample_set.xml.gz) into queryable relational forms. These methods have included:
These historical approaches have resulted in different database schemas and formats, including SQLite, PostgreSQL dumps, and now DuckDB. While the repository contains various relational representations of Biosample records, the focus going forward is on DuckDB as the primary format.
Users should be aware that past versions may differ in structure, and schema consistency across versions is not guaranteed. Users are encouraged to verify the schema details for their specific use case.
This work is primarily supported by NMDC funding, with a focus on making the results useful for NMDC workflows. While public availability is provided, the documentation and tools may not be fully optimized for bug-free, end-user consumption.
The conversion process remains an active area of development, and user feedback on schema utility and performance is welcome.