Developmental Guide

Core Module APIs

Sample Scoring

clep.sample_scoring.limma.do_limma()[source]

Perform data manipulation before limma based SS scoring.

Parameters

data – Dataframe containing the gene expression values
design – Dataframe containing the design table for the data
alpha – Family-wise error rate
method – Method used family-wise error correction
control – label used for representing the control in the design table of the data

Returns

Dataframe containing the Single Sample scores from limma

clep.sample_scoring.ssgsea.do_ssgsea()[source]

Run single sample GSEA (ssGSEA) on filtered gene expression data set.

Parameters

filtered_expression_data – filtered gene expression values for samples
gene_set – .gmt file containing gene sets
output_dir – output directory
processes – Number of processes
max_size – Maximum allowed number of genes from gene set also the data set
min_size – Minimum allowed number of genes from gene set also the data set

Returns

ssGSEA results in respective directory

clep.sample_scoring.z_score.do_z_score()[source]

Carry out Z-Score based single sample DE analysis.

Parameters

data – Dataframe containing the gene expression values
design – Dataframe containing the design table for the data
control – label used for representing the control in the design table of the data
threshold – Threshold for choosing patients that are “extreme” w.r.t. the controls.

Returns

Dataframe containing the Single Sample scores using Z_Scores

clep.sample_scoring.radical_search.do_radical_search()[source]

Identify the samples with extreme feature values either based on the entire dataset or control population.

Parameters

data – Dataframe containing the gene expression values
design – Dataframe containing the design table for the data
threshold – Threshold for choosing patients that are “extreme” w.r.t. the controls
control – label used for representing the control in the design table of the data
control_based – The scoring is based on the control population instead of entire dataset

Returns

Dataframe containing the Single Sample scores using radical searching

KG Generation

clep.embedding.network_generator.do_graph_gen()[source]

Generate patient-feature network given the data using a certain network generation method.

Parameters

data – Dataframe containing the patient-feature scores
network_gen_method – Method to generate the patient-feature network
gmt – Optional field for the path to the gmt file containing the pathway data
intersection_threshold – Threshold to make edges in Pathway Overlap method
kg_data – Optional field for the knowledge graph in edgelist format stored in a pandas dataframe
folder_path – Optional field for the path to a folder containing multiple knowledge graphs
jaccard_threshold – Threshold to make edges in Interaction Network Overlap method
summary – Flag to indicate if the summary of the patient-feature network must be returned

Returns

Dataframe containing patient-feature network, and optionally the summary of the patient-feature network

KG Embedding

clep.embedding.kge._weighted_splitter()[source]

Split the given edgelist into training, validation and testing sets on the basis of the ratio of relations.

Parameters

edgelist – Edgelist in the form of (Source, Relation, Target)
train_size – Size of the training data
validation_size – Size of the training data

Returns

Tuple containing the train, validation & test splits

clep.embedding.kge.do_kge()[source]

Carry out KGE on the given data.

Parameters

edgelist – Dataframe containing the patient-feature graph in edgelist format
design – Dataframe containing the design table for the data
out – Output folder for the results
model_config – Configuration file for the KGE models, in JSON format.
return_patients – Flag to indicate if the final data should contain only patients or even the features
train_size – Size of the training data for KGE ranging from 0 - 1
validation_size – Size of the validation data for KGE ranging from 0 - 1. It must be lower than training size

Returns

Dataframe containing the embedding from the KGE

Classification

clep.classification.classify.do_classification()[source]

Perform classification on embeddings generated from previous step.

Parameters

data – Dataframe containing the embeddings
model_name – model that should be used for cross validation
optimizer_name – Optimizer used to optimize the classification
out_dir – Path to the output directory
validation_cv – Number of cross validation steps
scoring_metrics – Scoring metrics tested during cross validation
rand_labels – Boolean variable to indicate if labels must be randomized to check for ML stability
args – Custom arguments to the estimator model

Returns

Dictionary containing the cross validation results