How to use CLEP

Sample Scoring

There are 4 main way to score the patient-feature pairs,

Linear model fitting using Limma
ssGSEA
Z-Score
Radical Searching (eCDF based)

To carry out sample scoring use,

$ clep sample-scoring radical-search --data <DATA_FILE> --design <DESIGN_FILE> \
--control Control --threshold 2.5 --control_based --ret_summary --out <OUTPUT_DIR>

Data Format

The format of a standard data file should look like,

	Sample_1	Sample_2	Sample_3
HGNC_ID_1	0.354	2.568	1.564
HGNC_ID_2	1.255	1.232	0.26452
HGNC_ID_3	3.256	1.5	1.5462

The format of a design file, for the data given above should look like,

FileName	Target
Sample_1	Abnormal
Sample_2	Abnormal
Sample_3	Control

Knowledge Graph Generation

A patient-feature knowledge graph (KG) can be generated using 3 methods,

Based on pathway overlaps (needs ssGSEA as the scoring functions)
Based on user-provided knowledge graph
Based on the overlap of multiple user-provided knowledge graph (needs the use of either ssGSEA, if each KG represents a distinct pathway, or any other appropriate 3rd party scoring function)

To carry out KG generation use,

$ clep embedding generate-network --data <SCORED_DATA_FILE> --method interaction_network \
--ret_summary --out <OUTPUT_DIR>

Data Format

The format of a knowledge graph file for the data given above should be a modified version of edgelist, as shown below,

Source	Relation	Target
HGNC_ID_1	association	HGNC_ID_2
HGNC_ID_2	decreases	HGNC_ID_3
HGNC_ID_3	increases	HGNC_ID_1

Knowledge Graph Embedding

For the generation of an embedding use,

$ clep embedding kge --data <NETWORK_FILE> --design <DESIGN_FILE> \
--model_config <MODEL_CONFIG.json> --train_size 0.8 --validation_size 0.1 --out <OUTPUT_DIR>

Data Format

The config file for the KGE model must contain the model name, and other optimization parameters, as shown in the template below,

{
  "model": "RotatE",
  "model_kwargs": {
    "automatic_memory_optimization": true
  },
  "model_kwargs_ranges": {
    "embedding_dim": {
      "type": "int",
      "low": 6,
      "high": 9,
      "scale": "power_two"
    }
  },
  "training_loop": "slcwa",
  "optimizer": "adam",
  "optimizer_kwargs": {
    "weight_decay": 0.0
  },
  "optimizer_kwargs_ranges": {
    "lr": {
      "type": "float",
      "low": 0.0001,
      "high": 1.0,
      "scale": "log"
    }
  },
  "loss_function": "NSSALoss",
  "loss_kwargs": {},
  "loss_kwargs_ranges": {
    "margin": {
      "type": "float",
      "low": 1,
      "high": 30,
      "q": 2.0
    },
    "adversarial_temperature": {
      "type": "float",
      "low": 0.1,
      "high": 1.0,
      "q": 0.1
    }
  },
  "regularizer": "NoRegularizer",
  "regularizer_kwargs": {},
  "regularizer_kwargs_ranges": {},
  "negative_sampler": "BasicNegativeSampler",
  "negative_sampler_kwargs": {},
  "negative_sampler_kwargs_ranges": {
    "num_negs_per_pos": {
      "type": "int",
      "low": 1,
      "high": 50,
      "q": 1
    }
  },
  "create_inverse_triples": false,
  "evaluator": "RankBasedEvaluator",
  "evaluator_kwargs": {
    "filtered": true
  },
  "evaluation_kwargs": {
    "batch_size": null
  },
  "training_kwargs": {
    "num_epochs": 1000,
    "label_smoothing": 0.0
  },
  "training_kwargs_ranges": {
    "batch_size": {
      "type": "int",
      "low": 8,
      "high": 11,
      "scale": "power_two"
    }
  },
  "stopper": "early",
  "stopper_kwargs": {
    "frequency": 25,
    "patience": 4,
    "delta": 0.002
  },
  "n_trials": 100,
  "timeout": 129600,
  "metric": "hits@10",
  "direction": "maximize",
  "sampler": "random",
  "pruner": "nop"
}

For more details on the configuration, check out PyKEEN

Classification

The classification of any provided data, can be carried out using any of the 5 different machine learning models,

Logistic regression with l2 regularization
Logistic regression with elastic net regularization
Support Vector Machines
Random forest
Gradient boosting

The classification also requires the input of the following optimizers,

Grid search
Random search
Bayesian search

For the carrying out the classification use,

$ clep classify --data <EMBEDDING_FILE> --model elastic_net --optimizer grid_search \
--out <OUTPUT_DIR>

Data Format

The format of the input file for classification should look like,

	Component_1	Component_2	Component_3	label
Sample_1	0.48687	-1.5675	1.74140	0
Sample_2	-1.48840	5.26354	-0.4435	1
Sample_3	-0.41461	4.6261	8.104	0

For more information on the command line interface, please refer Command Line Interface.

Programmatic Access

CLEP implements an API through which developers can utilise each module available in the CLEP framework. An example for the usage of the API functions in shown below.

import os
import pandas as pd
from clep.classification import do_classification

model = "elastic_net" # Classification Model
optimizer = "grid_search" # Optimization function for the classification model
out = os.getcwd() # Output directory
cv = 10 # Number of cross-validation folds
metrics = ['roc_auc', 'accuracy', 'f1_micro', 'f1_macro', 'f1'] # Metrics to be analysed in cross-validation
randomize = False # If the labels in the data must be permuted

data_df = pd.read_table(data, index_col=0)

results = do_classification(data_df, model, optimizer, out, cv, metrics, randomize)

For more information on the available API functions, please refer Developmental Guide.