Datasets

Molecular Property Prediction

Tox21

class dgllife.data.Tox21(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./tox21_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

Tox21 dataset.

Quoting [1], “The ‘Toxicology in the 21st Century’ (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8014 compounds on 12 different targets, including nuclear receptors and stress response pathways.” Each target result is a binary label.

A common issue for multi-task prediction is that some datapoints are not labeled for all tasks. This is also the case for Tox21. In data pre-processing, we set non-existing labels to be 0 so that they can be placed in tensors and used for masking in loss computation.

All molecules are converted into DGLGraphs. After the first-time construction, the DGLGraphs will be saved for reloading so that we do not need to reconstruct them everytime.

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘tox21_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> from dgllife.data import Tox21
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = Tox21(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
7831
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph, labels, and masks
>>> dataset[0]
('CCOc1ccc2nc(S(N)(=O)=O)sc2c1',
 DGLGraph(num_nodes=16, num_edges=34,
          ndata_schemes={}
          edata_schemes={}),
 tensor([0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0.]),
 tensor([1., 1., 1., 0., 0., 1., 1., 1., 1., 1., 1., 1.]))

The dataset instance also contains information about molecule ids.

>>> dataset.id[i]

We can also get the id along with SMILES, DGLGraph, labels, and masks at once.

>>> dataset.load_full = True
>>> dataset[0]
('CCOc1ccc2nc(S(N)(=O)=O)sc2c1',
 DGLGraph(num_nodes=16, num_edges=34,
          ndata_schemes={}
          edata_schemes={}),
 tensor([0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0.]),
 tensor([1., 1., 1., 0., 0., 1., 1., 1., 1., 1., 1., 1.]),
 'TOX3021')

To address the imbalance between positive and negative samples, we can re-weight positive samples for each task based on the training datapoints.

>>> train_ids = torch.arange(1000)
>>> dataset.task_pos_weights(train_ids)
tensor([26.9706, 35.3750,  5.9756, 21.6364,  6.4404, 21.4500, 26.0000,  5.0826,
        21.4390, 14.7692,  6.1442, 12.4308])
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the ith datapoint for all tasks. T for the number of tasks.

  • Tensor of dtype float32 and shape (T) – Binary masks of the ith datapoint indicating the existence of labels for all tasks.

  • str, optional – Id for the ith datapoint, returned only when self.load_full is True.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

task_pos_weights(indices)

Get weights for positive samples on each task

This should only be used when all tasks are binary classification.

It’s quite common that the number of positive samples and the number of negative samples are significantly different for binary classification. To compensate for the class imbalance issue, we can weight each datapoint in loss computation.

In particular, for each task we will set the weight of negative samples to be 1 and the weight of positive samples to be the number of negative samples divided by the number of positive samples.

Parameters

indices (1D LongTensor) – The function will compute the weights on the data subset specified by the indices, e.g. the indices for the training set.

Returns

Weight of positive samples on all tasks

Return type

Tensor of dtype float32 and shape (T)

ESOL

class dgllife.data.ESOL(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./esol_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

ESOL from MoleculeNet for the prediction of water solubility

Quoting [1], ” ESOL is a small dataset consisting of water solubility data for 1128 compounds. The dataset has been used to train models that estimate solubility directly from chemical structures (as encoded in SMILES strings). Note that these structures don’t include 3D coordinates, since solubility is a property of a molecule and not of its particular conformers.”

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] ESOL: estimating aqueous solubility directly from molecular structure.

  • [3] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘esol_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> from dgllife.data import ESOL
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = ESOL(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
1128
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph and solubility
>>> dataset[0]
('OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)C(O)C3O ',
 DGLGraph(num_nodes=32, num_edges=68,
          ndata_schemes={}
          edata_schemes={}),
 tensor([-0.7700]))

We also provide information for the name, estimated solubility, minimum atom degree, molecular weight, number of h bond donors, number of rings, number of rotatable bonds, and polar surface area of the compound

>>> # Access the information mentioned above for the ith datapoint
>>> dataset.compound_names[i]
>>> dataset.estimated_solubility[i]
>>> dataset.min_degree[i]
>>> dataset.mol_weight[i]
>>> dataset.num_h_bond_donors[i]
>>> dataset.num_rings[i]
>>> dataset.num_rotatable_bonds[i]
>>> dataset.polar_surface_area[i]

We can also get all these information along with SMILES, DGLGraph and solubility at once.

>>> dataset.load_full = True
>>> dataset[0]
('OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)C(O)C3O ',
 DGLGraph(num_nodes=32, num_edges=68,
          ndata_schemes={}
          edata_schemes={}),
 tensor([-0.7700]),
 'Amigdalin',
 -0.974,
 1,
 457.43200000000013,
 7,
 3,
 7,
 202.32)
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (1) – Labels of the ith datapoint

  • str, optional – Name for the ith compound, returned only when self.load_full is True.

  • float, optional – Estimated solubility for the ith compound, returned only when self.load_full is True.

  • int, optional – Minimum atom degree for the ith datapoint, returned only when self.load_full is True.

  • float, optional – Molecular weight for the ith datapoint, returned only when self.load_full is True.

  • int, optional – Number of h bond donors for the ith datapoint, returned only when self.load_full is True.

  • int, optional – Number of rings in the ith datapoint, returned only when self.load_full is True.

  • int, optional – Number of rotatable bonds in the ith datapoint, returned only when self.load_full is True.

  • float, optional – Polar surface area for the ith datapoint, returned only when self.load_full is True.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

FreeSolv

class dgllife.data.FreeSolv(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./freesolv_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

FreeSolv from MoleculeNet for the prediction of hydration free energy of small molecules in water

Quoting [1], “The Free Solvation Database (FreeSolv) provides experimental and calculated hydration free energy of small molecules in water. A subset of the compounds in the dataset are also used in the SAMPL blind prediction challenge. The calculated values are derived from alchemical free energy calculations using molecular dynamics simulations. We include the experimental values in the benchmark collection, and use calculated values for comparison.”

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] FreeSolv: a database of experimental and calculated hydration

    free energies, with input files.

  • [3] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘freesolv_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> from dgllife.data import FreeSolv
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = FreeSolv(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
642
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph and hydration free energy
>>> dataset[0]
('CN(C)C(=O)c1ccc(cc1)OC',
 DGLGraph(num_nodes=13, num_edges=26,
          ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
          edata_schemes={}),
 tensor([-11.0100]))

We also provide information for the iupac name and calculated hydration free energy of the compound.

>>> # Access the information mentioned above for the ith datapoint
>>> dataset.iupac_names[i]
>>> dataset.calc_energy[i]

We can also get all these information along with SMILES, DGLGraph and hydration free energy at once.

>>> dataset.load_full = True
>>> dataset[0]
('CN(C)C(=O)c1ccc(cc1)OC',
 DGLGraph(num_nodes=13, num_edges=26,
          ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
          edata_schemes={}), tensor([-11.0100]),
 '4-methoxy-N,N-dimethyl-benzamide',
 -9.625)
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (1) – Labels of the ith datapoint

  • str, optional – IUPAC nomenclature for the ith datapoint, returned only when self.load_full is True.

  • float, optional – Calculated hydration free energy for the ith datapoint, returned only when self.load_full is True.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

Lipophilicity

class dgllife.data.Lipophilicity(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./lipophilicity_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

Lipophilicity from MoleculeNet for the prediction of octanol/water distribution coefficient (logD at pH 7.4)

Quoting [1], “Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. This dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds.”

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] ChEMBL Deposited Data Set - AZ dataset; 2015.

  • [3] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘lipophilicity_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> from dgllife.data import Lipophilicity
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = Lipophilicity(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
4200
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph and logD
>>> dataset[0]
('Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14',
 DGLGraph(num_nodes=24, num_edges=54,
          ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
          edata_schemes={}),
 tensor([3.5400]))

We also provide information for the ChEMBL id of the compound.

>>> dataset.chembl_ids[i]

We can also get the ChEMBL id along with SMILES, DGLGraph and logD at once.

>>> dataset.load_full = True
>>> dataset[0]
('Cn1c(CN2CCN(CC2)c3ccc(Cl)cc3)nc4ccccc14',
 DGLGraph(num_nodes=24, num_edges=54,
          ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
          edata_schemes={}),
 tensor([3.5400]),
 'CHEMBL596271')
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (1) – Labels of the ith datapoint

  • str, optional – ChEMBL id of the ith datapoint, returned only when self.load_full is True.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

PCBA

class dgllife.data.PCBA(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./pcba_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

PCBA from MoleculeNet for the prediction of biological activities

Quoting [1], “PubChem BioAssay (PCBA) is a database consisting of biological activities of small molecules generated by high-throughput screening. We use a subset of PCBA, containing 128 bioassays measured over 400 thousand compounds, used by previous work to benchmark machine learning methods.”

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] Massively Multitask Networks for Drug Discovery.

  • [3] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘pcba_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> import torch
>>> from dgllife.data import PCBA
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = PCBA(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
437929
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph, labels, and masks
>>> dataset[0]
('CC(=O)N1CCC2(CC1)NC(=O)N(c1ccccc1)N2',
 DGLGraph(num_nodes=20, num_edges=44,
          ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
          edata_schemes={}),
 tensor([0., ..., 0.]),
 tensor([1., ..., 0.]))

The dataset instance also contains information about molecule ids.

>>> dataset.ids[i]

We can also get the id along with SMILES, DGLGraph, labels, and masks at once.

>>> dataset.load_full = True
>>> dataset[0]
('CC(=O)N1CCC2(CC1)NC(=O)N(c1ccccc1)N2',
 DGLGraph(num_nodes=20, num_edges=44,
          ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
          edata_schemes={}),
 tensor([0., ..., 0.]),
 tensor([1., ..., 0.]),
 'CID1511280')

To address the imbalance between positive and negative samples, we can re-weight positive samples for each task based on the training datapoints.

>>> train_ids = torch.arange(1000)
>>> dataset.task_pos_weights(train_ids)
tensor([7.3400, 489.0000, ..., 1.0000])
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the ith datapoint for all tasks. T for the number of tasks.

  • Tensor of dtype float32 and shape (T) – Binary masks of the ith datapoint indicating the existence of labels for all tasks.

  • str, optional – Id for the ith datapoint, returned only when self.load_full is True.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

task_pos_weights(indices)

Get weights for positive samples on each task

This should only be used when all tasks are binary classification.

It’s quite common that the number of positive samples and the number of negative samples are significantly different for binary classification. To compensate for the class imbalance issue, we can weight each datapoint in loss computation.

In particular, for each task we will set the weight of negative samples to be 1 and the weight of positive samples to be the number of negative samples divided by the number of positive samples.

Parameters

indices (1D LongTensor) – The function will compute the weights on the data subset specified by the indices, e.g. the indices for the training set.

Returns

Weight of positive samples on all tasks

Return type

Tensor of dtype float32 and shape (T)

MUV

class dgllife.data.MUV(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./muv_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

MUV from MoleculeNet for the prediction of biological activities

Quoting [1], “The Maximum Unbiased Validation (MUV) group is another benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis. The MUV dataset contains 17 challenging tasks for around 90 thousand compounds and is specifically designed for validation of virtual screening techniques.”

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘muv_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> import torch
>>> from dgllife.data import MUV
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = MUV(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
93087
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph, labels, and masks
>>> dataset[0]
('Cc1cccc(N2CCN(C(=O)C34CC5CC(CC(C5)C3)C4)CC2)c1C',
 Graph(num_nodes=26, num_edges=60,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={}),
 tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 tensor([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0.]))

The dataset instance also contains information about molecule ids.

>>> dataset.ids[i]

We can also get the id along with SMILES, DGLGraph, labels, and masks at once.

>>> dataset.load_full = True
>>> dataset[0]
('Cc1cccc(N2CCN(C(=O)C34CC5CC(CC(C5)C3)C4)CC2)c1C',
 Graph(num_nodes=26, num_edges=60,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={}),
 tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 tensor([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0.]),
 'CID2999678')

To address the imbalance between positive and negative samples, we can re-weight positive samples for each task based on the training datapoints.

>>> train_ids = torch.arange(40000)
>>> dataset.task_pos_weights(train_ids)
tensor([ 537.5833,  458.0000,  424.8000,  413.6667,  463.8571, 1060.5000,
         568.3636,  386.7500,  921.1429,  523.9167,  487.0769,  480.8462,
        1262.8000,  702.1111,  571.3636,  528.0000,  485.2308])
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the ith datapoint for all tasks. T for the number of tasks.

  • Tensor of dtype float32 and shape (T) – Binary masks of the ith datapoint indicating the existence of labels for all tasks.

  • str, optional – Id for the ith datapoint, returned only when self.load_full is True.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

task_pos_weights(indices)

Get weights for positive samples on each task

This should only be used when all tasks are binary classification.

It’s quite common that the number of positive samples and the number of negative samples are significantly different for binary classification. To compensate for the class imbalance issue, we can weight each datapoint in loss computation.

In particular, for each task we will set the weight of negative samples to be 1 and the weight of positive samples to be the number of negative samples divided by the number of positive samples.

Parameters

indices (1D LongTensor) – The function will compute the weights on the data subset specified by the indices, e.g. the indices for the training set.

Returns

Weight of positive samples on all tasks

Return type

Tensor of dtype float32 and shape (T)

HIV

class dgllife.data.HIV(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./hiv_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

HIV from MoleculeNet for the prediction of the ability to inhibit HIV replication

Quoting [1], “The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI), confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).”

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘hiv_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> import torch
>>> from dgllife.data import HIV
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = HIV(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
41127
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph, labels, and masks
>>> dataset[0]
('CCC1=[O+][Cu-3]2([O+]=C(CC)C1)[O+]=C(CC)CC(CC)=[O+]2',
 Graph(num_nodes=19, num_edges=40,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={}),
 tensor([0.]),
 tensor([1.]))

The dataset instance also contains information about the original screening result.

>>> dataset.activity[i]

We can also get the screening result along with SMILES, DGLGraph, labels, and masks at once.

>>> dataset.load_full = True
>>> dataset[0]
('CCC1=[O+][Cu-3]2([O+]=C(CC)C1)[O+]=C(CC)CC(CC)=[O+]2',
 Graph(num_nodes=19, num_edges=40,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={}),
 tensor([0.]),
 tensor([1.]),
 'CI')

To address the imbalance between positive and negative samples, we can re-weight positive samples for each task based on the training datapoints.

>>> train_ids = torch.arange(20000)
>>> dataset.task_pos_weights(train_ids)
tensor([33.1880])
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the ith datapoint for all tasks. T for the number of tasks.

  • Tensor of dtype float32 and shape (T) – Binary masks of the ith datapoint indicating the existence of labels for all tasks.

  • str, optional – Raw screening result, which can be CI, CA, or CM.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

task_pos_weights(indices)

Get weights for positive samples on each task

This should only be used when all tasks are binary classification.

It’s quite common that the number of positive samples and the number of negative samples are significantly different for binary classification. To compensate for the class imbalance issue, we can weight each datapoint in loss computation.

In particular, for each task we will set the weight of negative samples to be 1 and the weight of positive samples to be the number of negative samples divided by the number of positive samples.

Parameters

indices (1D LongTensor) – The function will compute the weights on the data subset specified by the indices, e.g. the indices for the training set.

Returns

Weight of positive samples on all tasks

Return type

Tensor of dtype float32 and shape (T)

BACE

class dgllife.data.BACE(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./bace_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

BACE from MoleculeNet for the prediction of quantitative and qualitative binding results for a set of inhibitors of human beta-secretase 1 (BACE-1)

The dataset contains experimental values reported in scientific literature over the past decade, some with detailed crystal structures available. The MoleculeNet benchmark merged a collection of 1522 compounds with their 2D structures and binary labels.

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘bace_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> import torch
>>> from dgllife.data import BACE
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = BACE(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
1513
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph, labels, and masks
>>> dataset[0]
('O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2ccccc2C)C)CC1(C)C',
 Graph(num_nodes=32, num_edges=70,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={}),
 tensor([1.]),
 tensor([1.]))

The dataset instance also contains information about molecule ids.

>>> dataset.ids[i]

We can also get the id along with SMILES, DGLGraph, labels, and masks at once.

>>> dataset.load_full = True
>>> dataset[0]
('O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2ccccc2C)C)CC1(C)C',
 Graph(num_nodes=32, num_edges=70,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={}),
 tensor([1.]),
 tensor([1.]),
 'BACE_1')

To address the imbalance between positive and negative samples, we can re-weight positive samples for each task based on the training datapoints.

>>> train_ids = torch.arange(500)
>>> dataset.task_pos_weights(train_ids)
tensor([0.2594])
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the ith datapoint for all tasks. T for the number of tasks.

  • Tensor of dtype float32 and shape (T) – Binary masks of the ith datapoint indicating the existence of labels for all tasks.

  • str, optional – Id for the ith datapoint, returned only when self.load_full is True.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

task_pos_weights(indices)

Get weights for positive samples on each task

This should only be used when all tasks are binary classification.

It’s quite common that the number of positive samples and the number of negative samples are significantly different for binary classification. To compensate for the class imbalance issue, we can weight each datapoint in loss computation.

In particular, for each task we will set the weight of negative samples to be 1 and the weight of positive samples to be the number of negative samples divided by the number of positive samples.

Parameters

indices (1D LongTensor) – The function will compute the weights on the data subset specified by the indices, e.g. the indices for the training set.

Returns

Weight of positive samples on all tasks

Return type

Tensor of dtype float32 and shape (T)

BBBP

class dgllife.data.BBBP(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./bbbp_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

BBBP from MoleculeNet for the prediction of permeability properties

Quoting [1], “The Blood–brain barrier penetration (BBBP) dataset comes from a recent study on the modeling and prediction of the barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood–brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system. This dataset includes binary labels for over 2000 compounds on their permeability properties.”

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] A Bayesian approach to in silico blood-brain barrier penetration modeling

  • [3] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘bbbp_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> import torch
>>> from dgllife.data import BBBP
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = BBBP(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
2039
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph, labels, and masks
>>> dataset[0]
('[Cl].CC(C)NCC(O)COc1cccc2ccccc12',
 Graph(num_nodes=20, num_edges=40,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={}),
 tensor([1.]),
 tensor([1.]))

The dataset instance also contains information about compound name.

>>> dataset.names[i]

We can also get the name along with SMILES, DGLGraph, labels, and masks at once.

>>> dataset.load_full = True
>>> dataset[0]
('[Cl].CC(C)NCC(O)COc1cccc2ccccc12',
 Graph(num_nodes=20, num_edges=40,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={}),
 tensor([1.]),
 tensor([1.]),
 'Propanolol')

To address the imbalance between positive and negative samples, we can re-weight positive samples for each task based on the training datapoints.

>>> train_ids = torch.arange(500)
>>> dataset.task_pos_weights(train_ids)
tensor([0.7123])
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the ith datapoint for all tasks. T for the number of tasks.

  • Tensor of dtype float32 and shape (T) – Binary masks of the ith datapoint indicating the existence of labels for all tasks.

  • str, optional – Name for the ith compound, returned only when self.load_full is True.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

task_pos_weights(indices)

Get weights for positive samples on each task

This should only be used when all tasks are binary classification.

It’s quite common that the number of positive samples and the number of negative samples are significantly different for binary classification. To compensate for the class imbalance issue, we can weight each datapoint in loss computation.

In particular, for each task we will set the weight of negative samples to be 1 and the weight of positive samples to be the number of negative samples divided by the number of positive samples.

Parameters

indices (1D LongTensor) – The function will compute the weights on the data subset specified by the indices, e.g. the indices for the training set.

Returns

Weight of positive samples on all tasks

Return type

Tensor of dtype float32 and shape (T)

ToxCast

class dgllife.data.ToxCast(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./toxcast_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

ToxCast from MoleculeNet for the prediction of toxicology data

The Toxicology in the 21st Century (https://tripod.nih.gov/tox21/challenge/) initiative created a data collection providing toxicology data for a large library of compounds based on in vitro high-throughput screening. The dataset includes qualitative results of over 600 experiments on 8615 compounds.

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] ToxCast Chemical Landscape: Paving the Road to 21st Century Toxicology

  • [3] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘toxcast_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> import torch
>>> from dgllife.data import ToxCast
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = ToxCast(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
8576
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph, labels, and masks
>>> dataset[0]
('[O-][N+](=O)C1=CC=C(Cl)C=C1',
 Graph(num_nodes=10, num_edges=20,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={})
 tensor([0., ..., 0.]),
 tensor([1., ..., 1.]))

To address the imbalance between positive and negative samples, we can re-weight positive samples for each task based on the training datapoints.

>>> train_ids = torch.arange(500)
>>> dataset.task_pos_weights(train_ids)
tensor([4.0435e+00, ..., 1.7500e+01])
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the ith datapoint for all tasks. T for the number of tasks.

  • Tensor of dtype float32 and shape (T) – Binary masks of the ith datapoint indicating the existence of labels for all tasks.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

task_pos_weights(indices)

Get weights for positive samples on each task

This should only be used when all tasks are binary classification.

It’s quite common that the number of positive samples and the number of negative samples are significantly different for binary classification. To compensate for the class imbalance issue, we can weight each datapoint in loss computation.

In particular, for each task we will set the weight of negative samples to be 1 and the weight of positive samples to be the number of negative samples divided by the number of positive samples.

Parameters

indices (1D LongTensor) – The function will compute the weights on the data subset specified by the indices, e.g. the indices for the training set.

Returns

Weight of positive samples on all tasks

Return type

Tensor of dtype float32 and shape (T)

SIDER

class dgllife.data.SIDER(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./sider_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

SIDER from MoleculeNet for the prediction of grouped drug side-effects

Quoting [1], “The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side-effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs (following previous usage).”

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘sider_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> import torch
>>> from dgllife.data import SIDER
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = SIDER(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
1427
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph, labels, and masks
>>> dataset[0]
('C(CNCCNCCNCCN)N',
 Graph(num_nodes=13, num_edges=24,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={}),
 tensor([1., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 1., 0., 0.,
         0., 0., 1., 0., 0., 1., 1., 0., 0., 1., 1., 1., 0.]),
 tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

To address the imbalance between positive and negative samples, we can re-weight positive samples for each task based on the training datapoints.

>>> train_ids = torch.arange(500)
>>> dataset.task_pos_weights(train_ids)
tensor([ 1.1368,  0.4793, 49.0000,  0.7123,  0.2626,  0.5015,  0.1211,  5.2500,
         0.4205,  1.0325,  3.1667,  0.1312,  3.9505,  5.9444,  0.3263,  0.7544,
         0.0823,  4.9524,  0.3889,  0.3812,  0.4706,  0.6447, 11.5000,  1.4272,
         0.5060,  0.1136,  0.5106])
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the ith datapoint for all tasks. T for the number of tasks.

  • Tensor of dtype float32 and shape (T) – Binary masks of the ith datapoint indicating the existence of labels for all tasks.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

task_pos_weights(indices)

Get weights for positive samples on each task

This should only be used when all tasks are binary classification.

It’s quite common that the number of positive samples and the number of negative samples are significantly different for binary classification. To compensate for the class imbalance issue, we can weight each datapoint in loss computation.

In particular, for each task we will set the weight of negative samples to be 1 and the weight of positive samples to be the number of negative samples divided by the number of positive samples.

Parameters

indices (1D LongTensor) – The function will compute the weights on the data subset specified by the indices, e.g. the indices for the training set.

Returns

Weight of positive samples on all tasks

Return type

Tensor of dtype float32 and shape (T)

ClinTox

class dgllife.data.ClinTox(smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./clintox_dglgraph.bin', n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

ClinTox from MoleculeNet for the prediction of clinical trial toxicity (or absence of toxicity) and FDA approval status

Quoting [1], “The ClinTox dataset, introduced as part of this work, compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures: (1) clinical trial toxicity (or absence of toxicity) and (2) FDA approval status. List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of ClinicalTrials.gov (AACT) database.”

References

  • [1] MoleculeNet: A Benchmark for Molecular Machine Learning.

  • [2] DeepChem

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs, default to ‘clintox_dglgraph.bin’.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> import torch
>>> from dgllife.data import ClinTox
>>> from dgllife.utils import SMILESToBigraph, CanonicalAtomFeaturizer
>>> smiles_to_g = SMILESToBigraph(node_featurizer=CanonicalAtomFeaturizer())
>>> dataset = ClinTox(smiles_to_g)
>>> # Get size of the dataset
>>> len(dataset)
1478
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph, labels, and masks
>>> dataset[0]
('*C(=O)[C@H](CCCCNC(=O)OCCOC)NC(=O)OCCOC',
 Graph(num_nodes=24, num_edges=46,
       ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
       edata_schemes={}),
 tensor([1., 0.]),
 tensor([1., 1.]))

To address the imbalance between positive and negative samples, we can re-weight positive samples for each task based on the training datapoints.

>>> train_ids = torch.arange(500)
>>> dataset.task_pos_weights(train_ids)
tensor([ 0.0684, 10.9048])
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the ith datapoint for all tasks. T for the number of tasks.

  • Tensor of dtype float32 and shape (T) – Binary masks of the ith datapoint indicating the existence of labels for all tasks.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

task_pos_weights(indices)

Get weights for positive samples on each task

This should only be used when all tasks are binary classification.

It’s quite common that the number of positive samples and the number of negative samples are significantly different for binary classification. To compensate for the class imbalance issue, we can weight each datapoint in loss computation.

In particular, for each task we will set the weight of negative samples to be 1 and the weight of positive samples to be the number of negative samples divided by the number of positive samples.

Parameters

indices (1D LongTensor) – The function will compute the weights on the data subset specified by the indices, e.g. the indices for the training set.

Returns

Weight of positive samples on all tasks

Return type

Tensor of dtype float32 and shape (T)

Experimental solubility determined at AstraZeneca, extracted from ChEMBL

class dgllife.data.AstraZenecaChEMBLSolubility(smiles_to_graph=<function smiles_to_bigraph>, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, cache_file_path='./AstraZeneca_chembl_solubility_graph.bin', log_of_values=True, n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

Experimental solubility determined at AstraZeneca, extracted from ChEMBL

The dataset provides experimental solubility (in nM unit) for 1763 molecules in pH7.4 using solid starting material using the method described in [1].

References

  • [1] A Highly Automated Assay for Determining the Aqueous Equilibrium Solubility of Drug Discovery Compounds

  • [2] CHEMBL3301361

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. Default to dgllife.utils.smiles_to_bigraph().

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • cache_file_path (str) – Path to the cached DGLGraphs. Default to ‘AstraZeneca_chembl_solubility_graph.bin’.

  • log_of_values (bool) – Whether to take the logarithm of the solubility values. Before taking the logarithm, the values can have a range of [100, 1513600]. After taking the logarithm, the values will have a range of [4.61, 14.23]. Default to True.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

Examples

>>> from dgllife.data import AstraZenecaChEMBLSolubility
>>> from dgllife.utils import smiles_to_bigraph, CanonicalAtomFeaturizer
>>> dataset = AstraZenecaChEMBLSolubility(smiles_to_bigraph, CanonicalAtomFeaturizer())
>>> # Get size of the dataset
>>> len(dataset)
1763
>>> # Get the 0th datapoint, consisting of SMILES, DGLGraph and solubility
>>> dataset[0]
('Cc1nc(C)c(-c2ccc([C@H]3CC[C@H](Cc4nnn[nH]4)CC3)cc2)nc1C(N)=O',
 DGLGraph(num_nodes=29, num_edges=64,
          ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
          edata_schemes={}),
 tensor([12.5032]))

We also provide information for the ChEMBL id and molecular weight of the compound.

>>> dataset.chembl_ids[i]
>>> dataset.mol_weight[i]

We can also get the ChEMBL id and molecular weight along with SMILES, DGLGraph and solubility at once.

>>> dataset.load_full = True
>>> dataset[0]
('Cc1nc(C)c(-c2ccc([C@H]3CC[C@H](Cc4nnn[nH]4)CC3)cc2)nc1C(N)=O',
 DGLGraph(num_nodes=29, num_edges=64,
          ndata_schemes={'h': Scheme(shape=(74,), dtype=torch.float32)}
          edata_schemes={}),
 tensor([12.5032]),
 'CHEMBL2178940',
 391.48)
__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (1) – Labels of the ith datapoint

  • str, optional – ChEMBL id of the ith datapoint, returned only when self.load_full is True.

  • float, optional – Molecular weight of the ith datapoint, returned only when self.load_full is True.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

Alchemy for Quantum Chemistry

class dgllife.data.TencentAlchemyDataset(mode='dev', mol_to_graph=<function mol_to_complete_graph>, node_featurizer=<function alchemy_nodes>, edge_featurizer=<function alchemy_edges>, load=True)[source]

Developed by the Tencent Quantum Lab, the dataset lists 12 quantum mechanical properties of 130, 000+ organic molecules, comprising up to 12 heavy atoms (C, N, O, S, F and Cl), sampled from the GDBMedChem database. These properties have been calculated using the open-source computational chemistry program Python-based Simulation of Chemistry Framework (PySCF).

For more details, check the paper.

Parameters
  • mode (str) – ‘dev’, ‘valid’ or ‘test’, separately for training, validation and test. Default to be ‘dev’. Note that ‘test’ is not available as the alchemy contest is ongoing.

  • mol_to_graph (callable, str -> DGLGraph) – A function turning an RDKit molecule instance into a DGLGraph. Default to dgllife.utils.mol_to_complete_graph().

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. By default, we construct graphs where nodes represent atoms and node features represent atom features. We store the atomic numbers under the name "node_type" and store the atom features under the name "n_feat". The atom features include: * One hot encoding for atom types * Atomic number of atoms * Whether the atom is a donor * Whether the atom is an acceptor * Whether the atom is aromatic * One hot encoding for atom hybridization * Total number of Hs on the atom

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. By default, we construct edges between every pair of atoms, excluding the self loops. We store the distance between the end atoms under the name "distance" and store the edge features under the name "e_feat". The edge features represent one hot encoding of edge types (bond types and non-bond edges).

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to True.

__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the datapoint for all tasks.

__len__()[source]

Size for the dataset.

Returns

Size for the dataset.

Return type

int

set_mean_and_std(mean=None, std=None)[source]

Set mean and std or compute from labels for future normalization.

The mean and std can be fetched later with self.mean and self.std.

Parameters
  • mean (float32 tensor of shape (T)) – Mean of labels for all tasks.

  • std (float32 tensor of shape (T)) – Std of labels for all tasks.

Pubmed Aromaticity

class dgllife.data.PubChemBioAssayAromaticity(smiles_to_graph=<function smiles_to_bigraph>, node_featurizer=None, edge_featurizer=None, load=False, log_every=1000, n_jobs=1)[source]

Bases: dgllife.data.csv_dataset.MoleculeCSVDataset

Subset of PubChem BioAssay Dataset for aromaticity prediction.

The dataset was constructed in Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism and is accompanied by the task of predicting the number of aromatic atoms in molecules.

The dataset was constructed by sampling 3945 molecules with 0-40 aromatic atoms from the PubChem BioAssay dataset.

Parameters
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. Default to dgllife.utils.smiles_to_bigraph().

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to pre-process from scratch. Default to False.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

  • n_jobs (int) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

__getitem__(item)

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the datapoint for all tasks

  • Tensor of dtype float32 and shape (T), optional – Binary masks indicating the existence of labels for all tasks. This is only generated when init_mask is True in the initialization.

__len__()

Size for the dataset

Returns

Size for the dataset

Return type

int

Adapting to New Datasets with CSV

class dgllife.data.MoleculeCSVDataset(df, smiles_to_graph=None, node_featurizer=None, edge_featurizer=None, smiles_column=None, cache_file_path=None, task_names=None, load=False, log_every=1000, init_mask=True, n_jobs=1, error_log=None)[source]

This is a general class for loading molecular data from pandas.DataFrame.

In data pre-processing, we construct a binary mask indicating the existence of labels.

All molecules are converted into DGLGraphs. After the first-time construction, the DGLGraphs can be saved for reloading so that we do not need to reconstruct them every time.

Parameters
  • df (pandas.DataFrame) – Dataframe including smiles and labels. Can be loaded by pandas.read_csv(file_path). One column includes smiles and some other columns include labels.

  • smiles_to_graph (callable, str -> DGLGraph) – A function turning a SMILES string into a DGLGraph. If None, it uses dgllife.utils.SMILESToBigraph() by default.

  • node_featurizer (None or callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph.

  • edge_featurizer (None or callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph.

  • smiles_column (str) – Column name for smiles in df.

  • cache_file_path (str) – Path to store the preprocessed DGLGraphs. For example, this can be 'dglgraph.bin'.

  • task_names (list of str or None, optional) – Columns in the data frame corresponding to real-valued labels. If None, we assume all columns except the smiles_column are labels. Default to None.

  • load (bool, optional) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to False.

  • log_every (bool, optional) – Print a message every time log_every molecules are processed. It only comes into effect when n_jobs is greater than 1. Default to 1000.

  • init_mask (bool, optional) – Whether to initialize a binary mask indicating the existence of labels. Default to True.

  • n_jobs (int, optional) – The maximum number of concurrently running jobs for graph construction and featurization, using joblib backend. Default to 1.

  • error_log (str, optional) – Path to a CSV file of molecules that RDKit failed to parse. If not specified, the molecules will not be recorded.

__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

  • Tensor of dtype float32 and shape (T) – Labels of the datapoint for all tasks

  • Tensor of dtype float32 and shape (T), optional – Binary masks indicating the existence of labels for all tasks. This is only generated when init_mask is True in the initialization.

__len__()[source]

Size for the dataset

Returns

Size for the dataset

Return type

int

task_pos_weights(indices)[source]

Get weights for positive samples on each task

This should only be used when all tasks are binary classification.

It’s quite common that the number of positive samples and the number of negative samples are significantly different for binary classification. To compensate for the class imbalance issue, we can weight each datapoint in loss computation.

In particular, for each task we will set the weight of negative samples to be 1 and the weight of positive samples to be the number of negative samples divided by the number of positive samples.

Parameters

indices (1D LongTensor) – The function will compute the weights on the data subset specified by the indices, e.g. the indices for the training set.

Returns

Weight of positive samples on all tasks

Return type

Tensor of dtype float32 and shape (T)

Adapting to New Datasets for Inference

class dgllife.data.UnlabeledSMILES(smiles_list, mol_to_graph=None, node_featurizer=None, edge_featurizer=None, log_every=1000)[source]

Construct a SMILES dataset without labels for inference.

We will 1) Filter out invalid SMILES strings and record canonical SMILES strings for valid ones 2) Construct a DGLGraph for each valid one and feature its node/edge

Parameters
  • smiles_list (list of str) – List of SMILES strings

  • mol_to_graph (callable, rdkit.Chem.rdchem.Mol -> DGLGraph) – A function turning an RDKit molecule object into a DGLGraph. Default to dgllife.utils.mol_to_bigraph().

  • node_featurizer (None or callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.

  • edge_featurizer (None or callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.

  • log_every (bool) – Print a message every time log_every molecules are processed. Default to 1000.

__getitem__(item)[source]

Get datapoint with index

Parameters

item (int) – Datapoint index

Returns

  • str – SMILES for the ith datapoint

  • DGLGraph – DGLGraph for the ith datapoint

__len__()[source]

Size for the dataset

Returns

Size for the dataset

Return type

int

Reaction Prediction

USPTO

class dgllife.data.USPTOCenter(subset, mol_to_graph=<function mol_to_bigraph>, node_featurizer=<dgllife.utils.featurizers.BaseAtomFeaturizer object>, edge_featurizer=<dgllife.utils.featurizers.BaseBondFeaturizer object>, atom_pair_featurizer=<function default_atom_pair_featurizer>, load=True, num_processes=1, cache=False)[source]

Bases: dgllife.data.uspto.WLNCenterDataset

USPTO dataset for reaction center prediction.

The dataset contains reactions from patents granted by United States Patent and Trademark Office (USPTO), collected by Lowe [1]. Jin et al. removes duplicates and erroneous reactions, obtaining a set of 480K reactions. They divide it into 400K, 40K, and 40K for training, validation and test.

References

  • [1] Patent reaction extraction

  • [2] Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network

Parameters
  • subset (str) –

    Whether to use the training/validation/test set as in Jin et al.

    • ’train’ for the training set

    • ’val’ for the validation set

    • ’test’ for the test set

  • mol_to_graph (callable, str -> DGLGraph) – A function turning RDKit molecule instances into DGLGraphs. Default to dgllife.utils.mol_to_bigraph().

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. By default, we consider descriptors including atom type, atom degree, atom explicit valence, atom implicit valence, aromaticity.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. By default, we consider descriptors including bond type, whether bond is conjugated and whether bond is in ring.

  • atom_pair_featurizer (callable, str -> dict) – Featurization for each pair of atoms in multiple reactants. The result will be used to update edata in the complete DGLGraphs. By default, the features include the bond type between the atoms (if any) and whether they belong to the same molecule.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to True.

  • num_processes (int) – Number of processes to use for data pre-processing. Default to 1.

  • cache (bool) – If True, construct and featurize all graphs at once.

__getitem__(item)

Get the i-th datapoint.

Returns

  • str – Reaction

  • str – Graph edits for the reaction

  • DGLGraph – DGLGraph for the ith molecular graph

  • DGLGraph – Complete DGLGraph, which will be needed for predicting scores between each pair of atoms

  • float32 tensor of shape (V^2, 10) – Features for each pair of atoms.

  • float32 tensor of shape (V^2, 5) – Labels for reaction center prediction. V for the number of atoms in the reactants.

__len__()

Get the size for the dataset.

Returns

Number of reactions in the dataset.

Return type

int

class dgllife.data.USPTORank(subset, candidate_bond_path, size_cutoff=100, max_num_changes_per_reaction=5, num_candidate_bond_changes=16, max_num_change_combos_per_reaction=150, num_processes=1)[source]

Bases: dgllife.data.uspto.WLNRankDataset

USPTO dataset for ranking candidate products.

The dataset contains reactions from patents granted by United States Patent and Trademark Office (USPTO), collected by Lowe [1]. Jin et al. removes duplicates and erroneous reactions, obtaining a set of 480K reactions. They divide it into 400K, 40K, and 40K for training, validation and test.

References

  • [1] Patent reaction extraction

  • [2] Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network

Parameters
  • subset (str) –

    Whether to use the training/validation/test set as in Jin et al.

    • ’train’ for the training set

    • ’val’ for the validation set

    • ’test’ for the test set

  • candidate_bond_path (str) – Path to the candidate bond changes for product enumeration, where each line is candidate bond changes for a reaction by a WLN for reaction center prediction.

  • size_cutoff (int) – By calling .ignore_large(True), we can optionally ignore reactions whose reactants contain more than size_cutoff atoms. Default to 100.

  • max_num_changes_per_reaction (int) – Maximum number of bond changes per reaction. Default to 5.

  • num_candidate_bond_changes (int) – Number of candidate bond changes to consider for each ground truth reaction. Default to 16.

  • max_num_change_combos_per_reaction (int) – Number of bond change combos to consider for each reaction. Default to 150.

  • num_processes (int) – Number of processes to use for data pre-processing. Default to 1.

__getitem__(item)

Get the i-th datapoint.

Parameters

item (int) – Index for the datapoint.

Returns

  • list of B + 1 DGLGraph – The first entry in the list is the DGLGraph for the reactants and the rest are DGLGraphs for candidate products. Each DGLGraph has edge features in edata[‘he’] and node features in ndata[‘hv’].

  • candidate_scores (float32 tensor of shape (B, 1)) – The sum of scores for bond changes in each combo, where B is the number of combos.

  • labels (int64 tensor of shape (1, 1), optional) – Index for the true candidate product, which is always 0 with pre-processing. This is returned only when we are not in the training mode.

  • valid_candidate_combos (list, optional) – valid_candidate_combos[i] gives a list of tuples, which is the i-th valid combo of candidate bond changes for the reaction. Each tuple is of form (atom1, atom2, change_type, score). atom1, atom2 are the atom mapping numbers - 1 of the two end atoms. change_type can be 0, 1, 2, 3, 1.5, separately for losing a bond, forming a single, double, triple, and aromatic bond.

  • reactant_mol (rdkit.Chem.rdchem.Mol) – RDKit molecule instance for the reactants

  • real_bond_changes (list of tuples) – Ground truth bond changes in a reaction. Each tuple is of form (atom1, atom2, change_type). atom1, atom2 are the atom mapping numbers - 1 of the two end atoms. change_type can be 0, 1, 2, 3, 1.5, separately for losing a bond, forming a single, double, triple, and aromatic bond.

  • product_mol (rdkit.Chem.rdchem.Mol) – RDKit molecule instance for the product

__len__()

Get the size for the dataset.

Returns

Number of reactions in the dataset.

Return type

int

ignore_large(ignore=True)

Whether to ignore reactions where reactants contain too many atoms.

Parameters

ignore (bool) – If ignore, reactions where reactants contain too many atoms will be ignored.

Adapting to New Datasets for Weisfeiler-Lehman Networks

class dgllife.data.WLNCenterDataset(raw_file_path, mol_graph_path, mol_to_graph=<function mol_to_bigraph>, node_featurizer=<dgllife.utils.featurizers.BaseAtomFeaturizer object>, edge_featurizer=<dgllife.utils.featurizers.BaseBondFeaturizer object>, atom_pair_featurizer=<function default_atom_pair_featurizer>, load=True, num_processes=1, check_reaction_validity=True, reaction_validity_result_prefix='', cache=True, **kwargs)[source]

Dataset for reaction center prediction with WLN

Parameters
  • raw_file_path (str) – Path to the raw reaction file, where each line is the SMILES for a reaction. We will check if raw_file_path + ‘.proc’ exists, where each line has the reaction SMILES and the corresponding graph edits. If not, we will preprocess the raw reaction file.

  • mol_graph_path (str) – Path to save/load DGLGraphs for molecules.

  • mol_to_graph (callable, str -> DGLGraph) – A function turning RDKit molecule instances into DGLGraphs. Default to dgllife.utils.mol_to_bigraph().

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. By default, we consider descriptors including atom type, atom degree, atom explicit valence, atom implicit valence, aromaticity.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. By default, we consider descriptors including bond type, whether bond is conjugated and whether bond is in ring.

  • atom_pair_featurizer (callable, str -> dict) – Featurization for each pair of atoms in multiple reactants. The result will be used to update edata in the complete DGLGraphs. By default, the features include the bond type between the atoms (if any) and whether they belong to the same molecule.

  • load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to True.

  • num_processes (int) – Number of processes to use for data pre-processing. Default to 1.

  • check_reaction_validity (bool) – Whether to check the validity of reactions before data pre-processing, which will introduce additional overhead. Default to True.

  • reaction_validity_result_prefix (str or None) – Prefix for saving results for checking validity of reactions. This argument only comes into effect if check_reaction_validity is True, in which case we will save valid reactions in reaction_validity_result_prefix + _valid_reactions.proc and invalid ones in reaction_validity_result_prefix + _invalid_reactions.proc. Default to ''.

  • cache (bool) – If True, construct and featurize all graphs at once.

__getitem__(item)[source]

Get the i-th datapoint.

Returns

  • str – Reaction

  • str – Graph edits for the reaction

  • DGLGraph – DGLGraph for the ith molecular graph

  • DGLGraph – Complete DGLGraph, which will be needed for predicting scores between each pair of atoms

  • float32 tensor of shape (V^2, 10) – Features for each pair of atoms.

  • float32 tensor of shape (V^2, 5) – Labels for reaction center prediction. V for the number of atoms in the reactants.

__len__()[source]

Get the size for the dataset.

Returns

Number of reactions in the dataset.

Return type

int

class dgllife.data.WLNRankDataset(path_to_reaction_file, candidate_bond_path, mode, node_featurizer=<dgllife.utils.featurizers.BaseAtomFeaturizer object>, edge_featurizer=<dgllife.utils.featurizers.BaseBondFeaturizer object>, size_cutoff=100, max_num_changes_per_reaction=5, num_candidate_bond_changes=16, max_num_change_combos_per_reaction=150, num_processes=1)[source]

Dataset for ranking candidate products with WLN

Parameters
  • path_to_reaction_file (str) – Path to the processed reaction files, where each line has the reaction SMILES and the corresponding graph edits.

  • candidate_bond_path (str) – Path to the candidate bond changes for product enumeration, where each line is candidate bond changes for a reaction by a WLN for reaction center prediction.

  • mode (str) – ‘train’, ‘val’, or ‘test’, indicating whether the dataset is used for training, validation or test.

  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. By default, we consider descriptors including atom type, atom formal charge, atom degree, atom explicit valence, atom implicit valence, aromaticity.

  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. By default, we consider descriptors including bond type and whether bond is in ring.

  • size_cutoff (int) – By calling .ignore_large(True), we can optionally ignore reactions whose reactants contain more than size_cutoff atoms. Default to 100.

  • max_num_changes_per_reaction (int) – Maximum number of bond changes per reaction. Default to 5.

  • num_candidate_bond_changes (int) – Number of candidate bond changes to consider for each ground truth reaction. Default to 16.

  • max_num_change_combos_per_reaction (int) – Number of bond change combos to consider for each reaction. Default to 150.

  • num_processes (int) – Number of processes to use for data pre-processing. Default to 1.

__getitem__(item)[source]

Get the i-th datapoint.

Parameters

item (int) – Index for the datapoint.

Returns

  • list of B + 1 DGLGraph – The first entry in the list is the DGLGraph for the reactants and the rest are DGLGraphs for candidate products. Each DGLGraph has edge features in edata[‘he’] and node features in ndata[‘hv’].

  • candidate_scores (float32 tensor of shape (B, 1)) – The sum of scores for bond changes in each combo, where B is the number of combos.

  • labels (int64 tensor of shape (1, 1), optional) – Index for the true candidate product, which is always 0 with pre-processing. This is returned only when we are not in the training mode.

  • valid_candidate_combos (list, optional) – valid_candidate_combos[i] gives a list of tuples, which is the i-th valid combo of candidate bond changes for the reaction. Each tuple is of form (atom1, atom2, change_type, score). atom1, atom2 are the atom mapping numbers - 1 of the two end atoms. change_type can be 0, 1, 2, 3, 1.5, separately for losing a bond, forming a single, double, triple, and aromatic bond.

  • reactant_mol (rdkit.Chem.rdchem.Mol) – RDKit molecule instance for the reactants

  • real_bond_changes (list of tuples) – Ground truth bond changes in a reaction. Each tuple is of form (atom1, atom2, change_type). atom1, atom2 are the atom mapping numbers - 1 of the two end atoms. change_type can be 0, 1, 2, 3, 1.5, separately for losing a bond, forming a single, double, triple, and aromatic bond.

  • product_mol (rdkit.Chem.rdchem.Mol) – RDKit molecule instance for the product

__len__()[source]

Get the size for the dataset.

Returns

Number of reactions in the dataset.

Return type

int

ignore_large(ignore=True)[source]

Whether to ignore reactions where reactants contain too many atoms.

Parameters

ignore (bool) – If ignore, reactions where reactants contain too many atoms will be ignored.

Protein-Ligand Binding Affinity Prediction

PDBBind

class dgllife.data.PDBBind(subset, pdb_version='v2015', load_binding_pocket=True, remove_coreset_from_refinedset=True, sanitize=False, calc_charges=False, remove_hs=False, use_conformation=True, construct_graph_and_featurize=<function ACNN_graph_construction_and_featurization>, zero_padding=True, num_processes=None, local_path=None)[source]

PDBbind dataset processed by moleculenet.

The description below is mainly based on [1]. The PDBBind database consists of experimentally measured binding affinities for bio-molecular complexes [2], [3]. It provides detailed 3D Cartesian coordinates of both ligands and their target proteins derived from experimental (e.g., X-ray crystallography) measurements. The availability of coordinates of the protein-ligand complexes permits structure-based featurization that is aware of the protein-ligand binding geometry. The authors of [1] use the “refined” and “core” subsets of the database [4], more carefully processed for data artifacts, as additional benchmarking targets.

References

  • [1] moleculenet: a benchmark for molecular machine learning

  • [2] The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures

  • [3] The PDBbind database: methodologies and updates

  • [4] PDB-wide collection of binding data: current status of the PDBbind database

Parameters
  • subset (str) – In moleculenet, we can use either the “refined” subset or the “core” subset. We can retrieve them by setting subset to be 'refined' or 'core'. The size of the 'core' set is 195 and the size of the 'refined' set is 3706.

  • pdb_version (str) – The version of PDBBind dataset. Currently implemented: 'v2007', 'v2015'. Default to 'v2015'. User should not specify the version if using local PDBBind data.

  • load_binding_pocket (bool) – Whether to load binding pockets or full proteins. Default to True.

  • remove_coreset_from_refinedset (bool) – Whether to remove core set from refined set when training with refined set and test with core set. Default to True.

  • sanitize (bool) – Whether sanitization is performed in initializing RDKit molecule instances. See https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization. Default to False.

  • calc_charges (bool) – Whether to add Gasteiger charges via RDKit. Setting this to be True will enforce sanitize to be True. Default to False.

  • remove_hs (bool) – Whether to remove hydrogens via RDKit. Note that removing hydrogens can be quite slow for large molecules. Default to False.

  • use_conformation (bool) – Whether we need to extract molecular conformation from proteins and ligands. Default to True.

  • construct_graph_and_featurize (callable) – Construct a DGLGraph for the use of GNNs. Mapping self.ligand_mols[i], self.protein_mols[i], self.ligand_coordinates[i] and self.protein_coordinates[i] to a DGLGraph. Default to dgllife.utils.ACNN_graph_construction_and_featurization().

  • zero_padding (bool) – Whether to perform zero padding. While DGL does not necessarily require zero padding, pooling operations for variable length inputs can introduce stochastic behaviour, which is not desired for sensitive scenarios. Default to True.

  • num_processes (int or None) – Number of worker processes to use. If None, then we will use the number of CPUs in the system. Default None.

  • local_path (str or None) – Local path of existing PDBBind dataset. Default None, and PDBBind dataset will be downloaded from DGL database. Specify this argument to a local path of customized dataset, which should follow the structure and the naming format of PDBBind v2015.

__getitem__(item)[source]

Get the datapoint associated with the index.

Parameters

item (int) – Index for the datapoint.

Returns

  • int – Index for the datapoint.

  • rdkit.Chem.rdchem.Mol – RDKit molecule instance for the ligand molecule.

  • rdkit.Chem.rdchem.Mol – RDKit molecule instance for the protein molecule.

  • DGLGraph or tuple of DGLGraphs – Pre-processed DGLGraph with features extracted. For ACNN, a single DGLGraph; For PotentialNet, a tuple of DGLGraphs that consists of a molecular graph and a KNN graph of the complex.

  • Float32 tensor – Label for the datapoint.

__len__()[source]

Get the size of the dataset.

Returns

Number of valid ligand-protein pairs in the dataset.

Return type

int