This project flattens and enriches biosample metadata from NCBI, GOLD, and NMDC, producing portable DuckDB databases.
⚠️ Documentation freshness: Some of the inputs into this auto-generated documentation may be outdated or contradictory. Always defer to the most recent docs and confirm counts with the current DuckDB exports. Older portals are kept for provenance only.
✅ Active, authoritative portal:
https://portal.nersc.gov/cfs/m3408/biosamples\_duckdb/
⚠️ Historical portal (for provenance
only):
https://portal.nersc.gov/project/m3513/biosample
NCBI BioSamples total: ~48 million as of 2025-09-19
DuckDB subset:
biosamples_flattened (current
authoritative count)Filtered because samples must:
collection_date ≥ 2017-01-01env_broad_scaleenv_local_scaleenv_mediumRationale: Matches the 2017–2024 window of the Google Earth Satellite Embedding V1 Annual.
biosamples_attributes)Two complementary DuckDBs are offered:
| Build | Entity Coverage | Field Richness | Processing Lineage | Typical Use |
|---|---|---|---|---|
| API-based | Studies and Sequencing Projects related to Biosamples (no Organism records or Organism-related Studies, etc.) | ~50+ biosample fields, full triads (e.g. latitude/longitude, env triads, sequencing depth) | GOLD API → MongoDB → DuckDB | Rich biosample analyses |
| Excel-based | All entity types (incl. isolates) | ~15 fields per entity (e.g. study_id, project_id, organism_name, basic metadata) | Bulk Excel export → flatten → DuckDB | Broad coverage, lightweight schema |
💡 When to choose: API build for depth; Excel build for breadth. Use API when you need triads and rich biosample fields; Excel when you need isolates or all entities but can tolerate fewer columns.
Note: SQL queries are intentionally omitted for now. This section focuses on table shape, usage, and caveats.
biosamples_flattened
— 3,037,277 rowsGrain: 1 row per BioSample accession
Canonicality: Canonical (authoritative for sample counts)
Parent source: Derived from
biosamples collection which is loaded from the NCBI
XML
Safe joins: join on
accession
Anti-patterns: don’t count samples using
env_triads_flattened or
biosamples_attributes
includes the following XML paths
BioSample.Attributes.Attribute
harmonized_name
BioSample.Curation
BioSample.Description.Comment.Paragraph
BioSample.Description.Organism
BioSample.Description.Organism.OrganismName
BioSample.Description.Synonym
BioSample.Description.Title
BioSample.Models.Model
BioSample.Owner.Name
BioSample.Package
BioSample.Status
Excludes
package_contentThe following are captured in separate tables and described below
harmonized_namebiosamples_attributes
— 52,518,729 rowsbiosamples
collection (raw long form)accessionattribute_name="isolation source",
content="soil"attribute_name="altitude",
content="300 m"env_triads_flattened
— 9,262,719 rowsbiosamples_flattened
(triad fields exploded and normalized)accessionenv_broad_scale="marine biome"env_medium="seawater"measurement_results_skip_filtered
— 87,466 rowsbiosamples_attributes(accession, harmonized_name)sra_biosamples_bioprojects
— 31,809,491 rowsbiosample_accessionbiosamples_ids —
7,871,449 rowsbiosamples
collectiondb="SRA", is_primary=truelabel="BioSample: SAMN12345678"biosamples_links —
2,335,376 rowsbiosamples
collectioncontent_pairs_aggregated
— 2,331,732 rowsattribute_harmonized_pairings
— 20,937 rowsncbi_attributes_flattened
— 960 rowsncbi_packages_flattened
— 229 rowsharmonized_name_usage_stats
— 695 rowsmeasurement_evidence_percentages
— 695 rowsharmonized_names are associated with measurement like
content valuesmixed_content_counts and unit_assertion_counts
collectionsmixed_content_counts —
440 rowscontent
values with numbers + lettersharmonized_name_dimensional_stats
— 432 rowsharmonized_name areunit_assertion_counts —
13 rowsunits
assertions included in submitted attributes (rare)biosamples_flattened ──→ env_triads_flattened
│
└─→ biosamples_attributes ──→ measurement_results_skip_filtered
│
└─→ content_pairs_aggregated
Summaries (e.g., harmonized_name_*,
mixed_content_counts) derive from these.
This design is intentionally denormalized for analytics. Think of it as a star-schema/lakehouse: fact tables at different grains plus conformed dimensions. It is not 3NF OLTP.
Implication: duplication across tables is deliberate. Correct use requires respecting row grain and join rules.
.duckdb filescontent_pairs_aggregated at 45M scale) may be
impracticalbiosample-xmldb-sqldb, biosample-basex) —
overlapping efforts, different XML DB enginessample-annotator, biosample-analysis)Generated: 2025-09-30
| Owner/Repo Notation | DB/Tech | Role | First Commit | Last Commit | Top Contributors | Notes |
|---|---|---|---|---|---|---|
turbomam/biosample-xmldb-sqldb |
XML→SQL | NCBI XML relational prototype | 2024-08-16 | 2024-08-16 | turbomam (76) | BaseX → SQL transformation |
turbomam/biosample-basex |
BaseX (XML DB) | XML-native store | 2024-08-16 | 2024-08-16 | turbomam (222) | Using BaseX XML database for structure discovery |
INCATools/biosample-analysis |
Python/OAK | Analysis notebooks, cross-repo | 2022-01-05 | 2022-01-05 | wdduncan (187), hrshdhgd (176), turbomam (34) | Analysis of biosamples in INSDC |
microbiomedata/sample-annotator |
MongoDB | Early flattening pipelines | 2025-04-21 | 2025-04-21 | turbomam (93), sujaypatil96 (83), cmungall (27) | NMDC Sample Annotator - predates xmldb/basex work |
microbiomedata/external-metadata-awareness |
MongoDB + SQLite + DuckDB | Large-scale flattening, 40M+ corpus | 2025-09-25 | 2025-09-25 | turbomam (491), dependabot[bot] (3) | Tools for fetching and processing external schemas, ontologies and metadata |
contextualizer-ai/to-duckdb |
DuckDB | Universal migration tooling | 2025-09-20 | 2025-09-20 | turbomam (15) | Tools for converting content in other databases to DuckDB |
contextualizer-ai/gold-tools |
MongoDB + DuckDB | GOLD API ingestion | 2025-09-16 | 2025-09-16 | turbomam (2) | GOLD metadata ingestion into MongoDB and flattening |
Export: YYYY-MM-DD • Build: vX.Y • Code commit: <hash>duckdb -c "SELECT table_name, row_count FROM duckdb_tables()" file.duckdb
git log --reverse --format="%ad %h %s" | head -1
git log -1 --format="%ad %h %s"
git shortlog -sne --all | head -20