dgllife.utils.analyze_mols¶

dgllife.utils.analyze_mols(smiles=None, mols=None, num_processes=1, path_to_export=None)[source]¶

Analyze a collection of molecules

The analysis will 1) filter out invalid molecules and record the valid ones; 2) record the number of molecules having each particular descriptor/element (e.g. single bond). The descriptors/elements considered include:

If path_to_export is not None, we will export the analysis results to the following files in path_to_export:

valid_canonical_smiles.txt: A file of canonical SMILES for valid molecules

summary.txt: A file of all analysis results, see the Examples section for more details. For summary, we either compute mean/std of values or count the frequency that a value appears in molecules.

Parameters

smiles (list of str, optional) – SMILES strings for a collection of molecules. Can be omitted if mols is not None. (Default: None)
mols (list of rdkit.Chem.rdchem.Mol objects, optional) – RDKit molecule instances for a collection of molecules. Can be omitted if smiles is not None. (Default: None)
num_processes (int, optional) – Number of processes for data analysis. (Default: 1)
path_to_export (str, optional) – The directory to export analysis results. If not None, we will export the analysis results to local files in the specified directory. (Default: None)

Returns

Summary of the analysis results. For more details, see the Examples section.

Return type

dict

Examples

>>> from dgllife.utils import analyze_mols

>>> smiles = ['CCO', 'CC1([C@@H](N2[C@H](S1)[C@@H](C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C', '1']
>>> # Analyze the results and save the results to the current directory
>>> results = analyze_mols(smiles, path_to_export='.')
>>> results
{'num_atoms': [3, 23],                    # Number of atoms in each molecule
 'num_bonds': [2, 25],                    # Number of bonds in each molecule
 'num_rings': [0, 3],                     # Number of rings in each molecule
 'num_input_mols': 3,                     # Number of input molecules
 'num_valid_mols': 2,                     # Number of valid molecules
 'valid_proportion': 0.6666666666666666,  # Proportion of valid molecules
 'cano_smi': ['CCO',                      # Canonical SMILES for valid molecules
 'CC1(C)S[C@@H]2[C@H](NC(=O)Cc3ccccc3)C(=O)N2[C@H]1C(=O)O'],
 # The following items give the number of times each descriptor value appears in molecules
 'atom_type_frequency': {'O': 2, 'C': 2, 'N': 1, 'S': 1},
 'degree_frequency': {1: 2, 2: 2, 3: 1, 4: 1},
 'total_degree_frequency': {2: 2, 4: 2, 1: 1, 3: 1},
 'explicit_valence_frequency': {1: 2, 2: 2, 3: 1, 4: 1},
 'implicit_valence_frequency': {1: 2, 2: 2, 3: 2, 0: 1},
 'hybridization_frequency': {'SP3': 2, 'SP2': 1},
 'total_num_h_frequency': {1: 2, 2: 2, 3: 2, 0: 1},
 'formal_charge_frequency': {0: 2},
 'num_radical_electrons_frequency': {0: 2},
 'aromatic_atom_frequency': {False: 2, True: 1},
 'chirality_tag_frequency': {'CHI_UNSPECIFIED': 2,
 'CHI_TETRAHEDRAL_CCW': 1,
 'CHI_TETRAHEDRAL_CW': 1},
 'bond_type_frequency': {'SINGLE': 2, 'DOUBLE': 1, 'AROMATIC': 1},
 'conjugated_bond_frequency': {False: 2, True: 1},
 'bond_stereo_configuration_frequency': {'STEREONONE': 2},
 'bond_direction_frequency': {'NONE': 2}}