Standardize and append a batch of data¶

Here, we’ll learn

how to standardize a less well curated collection
how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.settings.transform.stem_uid = "ManDYgmftZ8C"
ln.settings.transform.version = "1"
ln.track()

💡 connected lamindb: testuser1/test-scrna

💡 notebook imports: bionty==0.47.1 lamindb==0.75.0

💡 saved: Transform(uid='ManDYgmftZ8C5zKv', version='1', name='Standardize and append a batch of data', key='scrna2', type='notebook', created_by_id=1, updated_at='2024-08-05 13:24:10 UTC')

💡 saved: Run(uid='fztUH5IWMI1hC1YcaDXC', transform_id=2, created_by_id=1)

Run(uid='fztUH5IWMI1hC1YcaDXC', started_at='2024-08-05 13:24:10 UTC', is_consecutive=True, transform_id=2, created_by_id=1)

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
adata

We are still working with human data, and can globally set an organism:

bt.settings.organism = "human"

curate = ln.Curate.from_anndata(adata, var_index=bt.Gene.symbol, categoricals={adata.obs.cell_type.name: bt.CellType.name})

❗ 3 non-validated categories are not saved in Feature.name: ['n_genes', 'louvain', 'percent_mito']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns

✅ added 5 records from public with Gene.symbol for var_index: 'GPX1', 'SOD2', 'RN7SL1', 'SNORD3B-2', 'IGLL5'

❗ 11 non-validated categories are not saved in Gene.symbol: ['RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'RP3-467N11.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'AC084018.1', 'CTD-3138B18.5']!
      → to lookup categories, use lookup().var_index
      → to save, run add_new_from_var_index

Standardize & validate genes ¶

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()

💡 standardized 754/765 terms

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index

curate = ln.Curate.from_anndata(adata_validated, var_index=bt.Gene.ensembl_gene_id, categoricals={"cell_type": bt.CellType.name})

❗ 3 non-validated categories are not saved in Feature.name: ['n_genes', 'louvain', 'percent_mito']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns

curate.validate()

✅ var_index is validated against Gene.ensembl_gene_id

💡 mapping cell_type on CellType.name

❗    9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
      → save terms via .add_new_from('cell_type')

False

Standardize & validate cell types ¶

Since none of the cell types are validate, let us search the cell type names from the public ontology, and add the name found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = bt.CellType.from_source(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()
    record.add_synonym(name)

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

curate.validate()

✅ var_index is validated against Gene.ensembl_gene_id

✅ cell_type is validated against CellType.name

True

Register ¶

artifact = curate.save_artifact(description="10x reference adata")

💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/nwfXKFQAHARNDPqv35hZ.h5ad')

✅ storing artifact 'nwfXKFQAHARNDPqv35hZ' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/nwfXKFQAHARNDPqv35hZ.h5ad'

💡 parsing feature names of X stored in slot 'var'

✅    754 terms (100.00%) are validated for ensembl_gene_id

✅    linked: FeatureSet(uid='vRpEKk3L0cjBZsCAb4MP', n=754, dtype='float', registry='bionty.Gene', hash='j8QkIeLBgJwsscY4vVPx1A', created_by_id=1, run_id=2)

💡 parsing feature names of slot 'obs'

✅    1 term (25.00%) is validated for name

❗    3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain

✅    linked: FeatureSet(uid='6zUJu7hGWjD0q45fuUFD', n=1, registry='Feature', hash='eM0F81LwSjjzeSzsIRWoFg', created_by_id=1, run_id=2)

✅ saved 2 feature sets for slots: 'var','obs'

artifact.view_lineage()

_images/c07440414a211ccefadee81a1641b3842375286e8a8f37a4b0644a14857b08d0.svg

Append the dataset to the collection¶

Query the previous collection:

collection_v1 = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="1"
).one()

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = ln.Collection(
    [artifact, collection_v1.ordered_artifacts.first()],
    is_new_version_of=collection_v1,
)
collection_v2.save()

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()

Collection(uid='A22kL5r80OubMlqzD8fj', version='2', name='My versioned scRNA-seq collection', hash='Umjxg4HR1wkZqKROsyz1sw', visibility=1, updated_at='2024-08-05 13:24:42 UTC')
  Provenance
    .created_by = 'testuser1'
    .transform = 'Standardize and append a batch of data'
    .run = '2024-08-05 13:24:10 UTC'
  Feature sets
    'obs' = 'donor', 'tissue', 'cell_type', 'assay'
    'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'

View data lineage:

collection_v2.view_lineage()

_images/6bf3ff016283c79e4ab2da2bb42e83de47f37e7b68372c50f54250d378750b48.svg