Biosample Flattening Project – DuckDB Outputs

Biosample Flattening Project – DuckDB Outputs

Table of Contents


Overview

This project flattens and enriches biosample metadata from NCBI, GOLD, and NMDC, producing portable DuckDB databases.

⚠️ Documentation freshness: Some of the inputs into this auto-generated documentation may be outdated or contradictory. Always defer to the most recent docs and confirm counts with the current DuckDB exports. Older portals are kept for provenance only.


Hosting and Access


Scope and Filtering


Current DuckDB Offerings

NCBI Outputs

GOLD Outputs

Two complementary DuckDBs are offered:

Build Entity Coverage Field Richness Processing Lineage Typical Use
API-based Studies and Sequencing Projects related to Biosamples (no Organism records or Organism-related Studies, etc.) ~50+ biosample fields, full triads (e.g. latitude/longitude, env triads, sequencing depth) GOLD API → MongoDB → DuckDB Rich biosample analyses
Excel-based All entity types (incl. isolates) ~15 fields per entity (e.g. study_id, project_id, organism_name, basic metadata) Bulk Excel export → flatten → DuckDB Broad coverage, lightweight schema

💡 When to choose: API build for depth; Excel build for breadth. Use API when you need triads and rich biosample fields; Excel when you need isolates or all entities but can tolerate fewer columns.

NMDC Outputs


Key Tables (NCBI Subset)

Note: SQL queries are intentionally omitted for now. This section focuses on table shape, usage, and caveats.

biosamples_flattened — 3,037,277 rows

biosamples_attributes — 52,518,729 rows

env_triads_flattened — 9,262,719 rows

measurement_results_skip_filtered — 87,466 rows

sra_biosamples_bioprojects — 31,809,491 rows

biosamples_ids — 7,871,449 rows

content_pairs_aggregated — 2,331,732 rows

attribute_harmonized_pairings — 20,937 rows

ncbi_attributes_flattened — 960 rows

ncbi_packages_flattened — 229 rows

harmonized_name_usage_stats — 695 rows

measurement_evidence_percentages — 695 rows

mixed_content_counts — 440 rows

harmonized_name_dimensional_stats — 432 rows

unit_assertion_counts — 13 rows


Lineage and Grain

Data Flow

biosamples_flattened ──→ env_triads_flattened
       │
       └─→ biosamples_attributes ──→ measurement_results_skip_filtered
                         │
                         └─→ content_pairs_aggregated

Summaries (e.g., harmonized_name_*, mixed_content_counts) derive from these.

Analytics vs Normalization

This design is intentionally denormalized for analytics. Think of it as a star-schema/lakehouse: fact tables at different grains plus conformed dimensions. It is not 3NF OLTP.

Implication: duplication across tables is deliberate. Correct use requires respecting row grain and join rules.


Why DuckDB Now


Strengths


Limitations


Comparison to Past Outputs


Historical Systems (Repos × Engines × Timeline)

Generated: 2025-09-30

Owner/Repo Notation DB/Tech Role First Commit Last Commit Top Contributors Notes
turbomam/biosample-xmldb-sqldb XML→SQL NCBI XML relational prototype 2024-08-16 2024-08-16 turbomam (76) BaseX → SQL transformation
turbomam/biosample-basex BaseX (XML DB) XML-native store 2024-08-16 2024-08-16 turbomam (222) Using BaseX XML database for structure discovery
INCATools/biosample-analysis Python/OAK Analysis notebooks, cross-repo 2022-01-05 2022-01-05 wdduncan (187), hrshdhgd (176), turbomam (34) Analysis of biosamples in INSDC
microbiomedata/sample-annotator MongoDB Early flattening pipelines 2025-04-21 2025-04-21 turbomam (93), sujaypatil96 (83), cmungall (27) NMDC Sample Annotator - predates xmldb/basex work
microbiomedata/external-metadata-awareness MongoDB + SQLite + DuckDB Large-scale flattening, 40M+ corpus 2025-09-25 2025-09-25 turbomam (491), dependabot[bot] (3) Tools for fetching and processing external schemas, ontologies and metadata
contextualizer-ai/to-duckdb DuckDB Universal migration tooling 2025-09-20 2025-09-20 turbomam (15) Tools for converting content in other databases to DuckDB
contextualizer-ai/gold-tools MongoDB + DuckDB GOLD API ingestion 2025-09-16 2025-09-16 turbomam (2) GOLD metadata ingestion into MongoDB and flattening

Validation Checklist

  1. Version banner (add to README header):
    Export: YYYY-MM-DD • Build: vX.Y • Code commit: <hash>
  2. DuckDB table counts
duckdb -c "SELECT table_name, row_count FROM duckdb_tables()" file.duckdb
  1. GitHub repo timelines & contributors
git log --reverse --format="%ad %h %s" | head -1
git log -1 --format="%ad %h %s"
git shortlog -sne --all | head -20

Known Discrepancies


Future Directions