How to use CLEP¶
Sample Scoring¶
There are 4 main way to score the patient-feature pairs,
Linear model fitting using Limma
ssGSEA
Z-Score
Radical Searching (eCDF based)
To carry out sample scoring use,
$ clep sample-scoring radical-search --data <DATA_FILE> --design <DESIGN_FILE> \
--control Control --threshold 2.5 --control_based --ret_summary --out <OUTPUT_DIR>
Data Format¶
The format of a standard data file should look like,
Sample_1 |
Sample_2 |
Sample_3 |
|
---|---|---|---|
HGNC_ID_1 |
0.354 |
2.568 |
1.564 |
HGNC_ID_2 |
1.255 |
1.232 |
0.26452 |
HGNC_ID_3 |
3.256 |
1.5 |
1.5462 |
The format of a design file, for the data given above should look like,
FileName |
Target |
---|---|
Sample_1 |
Abnormal |
Sample_2 |
Abnormal |
Sample_3 |
Control |
Knowledge Graph Generation¶
A patient-feature knowledge graph (KG) can be generated using 3 methods,
Based on pathway overlaps (needs ssGSEA as the scoring functions)
Based on user-provided knowledge graph
Based on the overlap of multiple user-provided knowledge graph (needs the use of either ssGSEA, if each KG represents a distinct pathway, or any other appropriate 3rd party scoring function)
To carry out KG generation use,
$ clep embedding generate-network --data <SCORED_DATA_FILE> --method interaction_network \
--ret_summary --out <OUTPUT_DIR>
Data Format¶
The format of a knowledge graph file for the data given above should be a modified version of edgelist, as shown below,
Source |
Relation |
Target |
---|---|---|
HGNC_ID_1 |
association |
HGNC_ID_2 |
HGNC_ID_2 |
decreases |
HGNC_ID_3 |
HGNC_ID_3 |
increases |
HGNC_ID_1 |
Knowledge Graph Embedding¶
For the generation of an embedding use,
$ clep embedding kge --data <NETWORK_FILE> --design <DESIGN_FILE> \
--model_config <MODEL_CONFIG.json> --train_size 0.8 --validation_size 0.1 --out <OUTPUT_DIR>
Data Format¶
The config file for the KGE model must contain the model name, and other optimization parameters, as shown in the template below,
{
"model": "RotatE",
"model_kwargs": {
"automatic_memory_optimization": true
},
"model_kwargs_ranges": {
"embedding_dim": {
"type": "int",
"low": 6,
"high": 9,
"scale": "power_two"
}
},
"training_loop": "slcwa",
"optimizer": "adam",
"optimizer_kwargs": {
"weight_decay": 0.0
},
"optimizer_kwargs_ranges": {
"lr": {
"type": "float",
"low": 0.0001,
"high": 1.0,
"scale": "log"
}
},
"loss_function": "NSSALoss",
"loss_kwargs": {},
"loss_kwargs_ranges": {
"margin": {
"type": "float",
"low": 1,
"high": 30,
"q": 2.0
},
"adversarial_temperature": {
"type": "float",
"low": 0.1,
"high": 1.0,
"q": 0.1
}
},
"regularizer": "NoRegularizer",
"regularizer_kwargs": {},
"regularizer_kwargs_ranges": {},
"negative_sampler": "BasicNegativeSampler",
"negative_sampler_kwargs": {},
"negative_sampler_kwargs_ranges": {
"num_negs_per_pos": {
"type": "int",
"low": 1,
"high": 50,
"q": 1
}
},
"create_inverse_triples": false,
"evaluator": "RankBasedEvaluator",
"evaluator_kwargs": {
"filtered": true
},
"evaluation_kwargs": {
"batch_size": null
},
"training_kwargs": {
"num_epochs": 1000,
"label_smoothing": 0.0
},
"training_kwargs_ranges": {
"batch_size": {
"type": "int",
"low": 8,
"high": 11,
"scale": "power_two"
}
},
"stopper": "early",
"stopper_kwargs": {
"frequency": 25,
"patience": 4,
"delta": 0.002
},
"n_trials": 100,
"timeout": 129600,
"metric": "hits@10",
"direction": "maximize",
"sampler": "random",
"pruner": "nop"
}
For more details on the configuration, check out PyKEEN
Classification¶
The classification of any provided data, can be carried out using any of the 5 different machine learning models,
Logistic regression with l2 regularization
Logistic regression with elastic net regularization
Support Vector Machines
Random forest
Gradient boosting
The classification also requires the input of the following optimizers,
Grid search
Random search
Bayesian search
For the carrying out the classification use,
$ clep classify --data <EMBEDDING_FILE> --model elastic_net --optimizer grid_search \
--out <OUTPUT_DIR>
Data Format¶
The format of the input file for classification should look like,
Component_1 |
Component_2 |
Component_3 |
label |
|
---|---|---|---|---|
Sample_1 |
0.48687 |
-1.5675 |
1.74140 |
0 |
Sample_2 |
-1.48840 |
5.26354 |
-0.4435 |
1 |
Sample_3 |
-0.41461 |
4.6261 |
8.104 |
0 |
For more information on the command line interface, please refer Command Line Interface.
Programmatic Access¶
CLEP implements an API through which developers can utilise each module available in the CLEP framework. An example for the usage of the API functions in shown below.
import os
import pandas as pd
from clep.classification import do_classification
model = "elastic_net" # Classification Model
optimizer = "grid_search" # Optimization function for the classification model
out = os.getcwd() # Output directory
cv = 10 # Number of cross-validation folds
metrics = ['roc_auc', 'accuracy', 'f1_micro', 'f1_macro', 'f1'] # Metrics to be analysed in cross-validation
randomize = False # If the labels in the data must be permuted
data_df = pd.read_table(data, index_col=0)
results = do_classification(data_df, model, optimizer, out, cv, metrics, randomize)
For more information on the available API functions, please refer Developmental Guide.