Help on module DeepTCR:

NAME
    DeepTCR

DESCRIPTION
    # import sys
    # sys.path.append('../')
    # from DeepTCR.functions.Layers import *
    # from DeepTCR.functions.utils_u import *
    # from DeepTCR.functions.utils_s import *
    # from DeepTCR.functions.MIL import *
    # import seaborn as sns
    # import colorsys
    # from scipy.cluster.hierarchy import linkage,fcluster,dendrogram, leaves_list
    # from scipy.stats import wasserstein_distance, entropy
    # from scipy.spatial.distance import pdist, squareform
    # import umap
    # from sklearn.cluster import DBSCAN,KMeans
    # import sklearn
    # import phenograph
    # from scipy.spatial import distance
    # import glob
    # from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MultiLabelBinarizer
    # from multiprocessing import Pool
    # import pickle
    # import matplotlib.pyplot as plt
    # from sklearn.metrics import roc_curve, roc_auc_score
    # import shutil
    # import warnings
    # from scipy.stats import spearmanr,gaussian_kde

CLASSES
    builtins.object
        DeepTCR_base
            DeepTCR_S_base(DeepTCR_base, feature_analytics_class, vis_class)
                DeepTCR_SS
                DeepTCR_WF
            DeepTCR_U(DeepTCR_base, feature_analytics_class, vis_class)
        feature_analytics_class
        vis_class
    
    class DeepTCR_SS(DeepTCR_S_base)
     |  Method resolution order:
     |      DeepTCR_SS
     |      DeepTCR_S_base
     |      DeepTCR_base
     |      feature_analytics_class
     |      vis_class
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  Get_Train_Valid_Test(self, test_size=0.25, LOO=None, split_by_sample=False)
     |      Train/Valid/Test Splits.
     |      
     |      Divide data for train, valid, test set. Training is used to
     |      train model parameters, validation is used to set early stopping,
     |      and test acts as blackbox independent test set.
     |      
     |      Inputs
     |      ---------------------------------------
     |      test_size: float
     |          Fraction of sample to be used for valid and test set.
     |      
     |      LOO: int
     |          Number of samples to leave-out in Leave-One-Out Cross-Validation. For example,
     |          when set to 2, 2 samples will be left out for the validation set and 2 samples will be left
     |          out for the test set.
     |      
     |      split_by_sample: int
     |          In the case one wants to train the single sequence classifer but not to mix the train/test
     |          sets with sequences from different samples, one can set this parameter to True to do the train/test
     |          splits by sample.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  K_Fold_CrossVal(self, folds=None, epochs_min=10, batch_size=1000, stop_criterion=0.001, stop_criterion_window=10, kernel=5, trainable_embedding=True, weight_by_class=False, class_weights=None, num_fc_layers=0, units_fc=12, drop_out_rate=0.0, suppress_output=False, iterations=None, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, split_by_sample=False)
     |      K_Fold Cross-Validation for Single-Sequence Classifier
     |      
     |      If the number of sequences is small but training the single-sequence classifier, one
     |      can use K_Fold Cross Validation to train on all but one before assessing
     |      predictive performance.After this method is run, the AUC_Curve method can be run to
     |      assess the overall performance.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      folds: int
     |          Number of Folds
     |      
     |      epochs_min: int
     |          Minimum number of epochs for training neural network.
     |      
     |      batch_size: int
     |          Size of batch to be used for each training iteration of the net.
     |      
     |      stop_criterion: float
     |          Minimum percent decrease in determined interval (below) to continue
     |          training. Used as early stopping criterion.
     |      
     |      stop_criterion_window: int
     |          The window of data to apply the stopping criterion.
     |      
     |      kernel: int
     |          Size of convolutional kernel for first layer of convolutions.
     |      
     |      trainable_embedding; bool
     |          Toggle to control whether a trainable embedding layer is used or native
     |          one-hot representation for convolutional layers.
     |      
     |      weight_by_class: bool
     |          Option to weight loss by the inverse of the class frequency. Useful for
     |          unbalanced classes.
     |      
     |      class_weights: dict
     |          In order to specify custom weights for each class during training, one
     |          can provide a dictionary with these weights.
     |              i.e. {'A':1.0,'B':2.0'}
     |      
     |      num_fc_layers: int
     |          Number of fully connected layers following convolutional layer.
     |      
     |      units_fc: int
     |          Number of nodes per fully-connected layers following convolutional layer.
     |      
     |      drop_out_rate: float
     |          drop out rate for fully connected layers
     |      
     |      suppress_output: bool
     |          To suppress command line output with training statisitcs, set to True.
     |      
     |      iterations: int
     |          Option to specify how many iterations one wants to complete before
     |          terminating training. Useful for very large datasets.
     |      
     |      use_only_gene: bool
     |          To only use gene-usage features, set to True. This will turn off features from
     |          the sequences.
     |      
     |      use_only_seq: bool
     |          To only use sequence feaures, set to True. This will turn off features learned
     |          from gene usage.
     |      
     |      use_only_hla: bool
     |          To only use hla feaures, set to True.
     |      
     |      size_of_net: list or str
     |          The convolutional layers of this network have 3 layers for which the use can
     |          modify the number of neurons per layer. The user can either specify the size of the network
     |          with the following options:
     |              - small == [12,32,64] neurons for the 3 respective layers
     |              - medium == [32,64,128] neurons for the 3 respective layers
     |              - large == [64,128,256] neurons for the 3 respective layers
     |              - custom, where the user supplies a list with the number of nuerons for the respective layers
     |                  i.e. [3,3,3] would have 3 neurons for all 3 layers.
     |      
     |      embedding_dim_aa: int
     |          Learned latent dimensionality of amino-acids.
     |      
     |      embedding_dim_genes: int
     |          Learned latent dimensionality of VDJ genes
     |      
     |      embedding_dim_hla: int
     |          Learned latent dimensionality of HLA
     |      
     |      split_by_sample: int
     |          In the case one wants to train the single sequence classifer but not to mix the train/test
     |          sets with sequences from different samples, one can set this parameter to True to do the train/test
     |          splits by sample.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  Monte_Carlo_CrossVal(self, folds=5, test_size=0.25, LOO=None, epochs_min=10, batch_size=1000, stop_criterion=0.001, stop_criterion_window=10, kernel=5, trainable_embedding=True, weight_by_class=False, class_weights=None, num_fc_layers=0, units_fc=12, drop_out_rate=0.0, suppress_output=False, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, split_by_sample=False)
     |      Monte Carlo Cross-Validation for Single-Sequence Classifier
     |      
     |      If the number of sequences is small but training the single-sequence classifier, one
     |      can use Monte Carlo Cross Validation to train a number of iterations before assessing
     |      predictive performance.After this method is run, the AUC_Curve method can be run to
     |      assess the overall performance.
     |      
     |      Inputs
     |      ---------------------------------------
     |      folds: int
     |          Number of iterations for Cross-Validation
     |      
     |      test_size: float
     |          Fraction of sample to be used for valid and test set.
     |      
     |      LOO: int
     |          Number of sequences to leave-out in Leave-One-Out Cross-Validation
     |      
     |      epochs_min: int
     |          Minimum number of epochs for training neural network.
     |      
     |      batch_size: int
     |          Size of batch to be used for each training iteration of the net.
     |      
     |      stop_criterion: float
     |          Minimum percent decrease in determined interval (below) to continue
     |          training. Used as early stopping criterion.
     |      
     |      stop_criterion_window: int
     |          The window of data to apply the stopping criterion.
     |      
     |      kernel: int
     |          Size of convolutional kernel for first layer of convolutions.
     |      
     |      trainable_embedding; bool
     |          Toggle to control whether a trainable embedding layer is used or native
     |          one-hot representation for convolutional layers.
     |      
     |      weight_by_class: bool
     |          Option to weight loss by the inverse of the class frequency. Useful for
     |          unbalanced classes.
     |      
     |      class_weights: dict
     |          In order to specify custom weights for each class during training, one
     |          can provide a dictionary with these weights.
     |              i.e. {'A':1.0,'B':2.0'}
     |      
     |      num_fc_layers: int
     |          Number of fully connected layers following convolutional layer.
     |      
     |      units_fc: int
     |          Number of nodes per fully-connected layers following convolutional layer.
     |      
     |      drop_out_rate: float
     |          drop out rate for fully connected layers
     |      
     |      suppress_output: bool
     |          To suppress command line output with training statisitcs, set to True.
     |      
     |      use_only_gene: bool
     |          To only use gene-usage features, set to True.
     |      
     |      use_only_seq: bool
     |          To only use sequence feaures, set to True.
     |      
     |      use_only_hla: bool
     |          To only use hla feaures, set to True.
     |      
     |      size_of_net: list or str
     |          The convolutional layers of this network have 3 layers for which the use can
     |          modify the number of neurons per layer. The user can either specify the size of the network
     |          with the following options:
     |              - small == [12,32,64] neurons for the 3 respective layers
     |              - medium == [32,64,128] neurons for the 3 respective layers
     |              - large == [64,128,256] neurons for the 3 respective layers
     |              - custom, where the user supplies a list with the number of nuerons for the respective layers
     |                  i.e. [3,3,3] would have 3 neurons for all 3 layers.
     |      
     |      embedding_dim_aa: int
     |          Learned latent dimensionality of amino-acids.
     |      
     |      embedding_dim_genes: int
     |          Learned latent dimensionality of VDJ genes
     |      
     |      embedding_dim_hla: int
     |          Learned latent dimensionality of HLA
     |      
     |      split_by_sample: int
     |          In the case one wants to train the single sequence classifer but not to mix the train/test
     |          sets with sequences from different samples, one can set this parameter to True to do the train/test
     |          splits by sample.
     |      
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  SRCC(self, s=10, kde=False)
     |      Spearman's Rank Correlation Coefficient Plot
     |      
     |      In the case one is doing a regression-based model for the sequence classiifer,
     |      one can plot the predicted vs actual labeled value with this method. The method
     |      returns a plot for the regression and a value of the correlation coefficient.
     |      
     |      Inputs
     |      ---------------------------------------
     |      s: int
     |          size of points for scatterplot
     |      
     |      kde: bool
     |          To do a kernel density estimation per point and plot this as a color-scheme,
     |          set to True. Warning: this option will take longer to run.
     |      
     |      Returns
     |      ---------------------------------------
     |      corr: float
     |          Spearman's Rank Correlation Coefficient
     |  
     |  Train(self, batch_size=1000, epochs_min=10, stop_criterion=0.001, stop_criterion_window=10, kernel=5, trainable_embedding=True, weight_by_class=False, class_weights=None, num_fc_layers=0, units_fc=12, drop_out_rate=0.0, suppress_output=False, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12)
     |      Train Single-Sequence Classifier
     |      
     |      This method trains the network and saves features values at the
     |      end of training for motif analysis.
     |      
     |      Inputs
     |      ---------------------------------------
     |      batch_size: int
     |          Size of batch to be used for each training iteration of the net.
     |      
     |      epochs_min: int
     |          Minimum number of epochs for training neural network.
     |      
     |      stop_criterion: float
     |          Minimum percent decrease in determined interval (below) to continue
     |          training. Used as early stopping criterion.
     |      
     |      stop_criterion_window: int
     |          The window of data to apply the stopping criterion.
     |      
     |      kernel: int
     |          Size of convolutional kernel.
     |      
     |      trainable_embedding; bool
     |          Toggle to control whether a trainable embedding layer is used or native
     |          one-hot representation for convolutional layers.
     |      
     |      weight_by_class: bool
     |          Option to weight loss by the inverse of the class frequency. Useful for
     |          unbalanced classes.
     |      
     |      class_weights: dict
     |          In order to specify custom weights for each class during training, one
     |          can provide a dictionary with these weights.
     |              i.e. {'A':1.0,'B':2.0'}
     |      
     |      num_fc_layers: int
     |          Number of fully connected layers following convolutional layer.
     |      
     |      units_fc: int
     |          Number of nodes per fully-connected layers following convolutional layer.
     |      
     |      drop_out_rate: float
     |          drop out rate for fully connected layers
     |      
     |      suppress_output: bool
     |          To suppress command line output with training statisitcs, set to True.
     |      
     |      use_only_gene: bool
     |          To only use gene-usage features, set to True.
     |      
     |      use_only_seq: bool
     |          To only use sequence feaures, set to True.
     |      
     |      use_only_hla: bool
     |          To only use hla feaures, set to True.
     |      
     |      size_of_net: list or str
     |          The convolutional layers of this network have 3 layers for which the use can
     |          modify the number of neurons per layer. The user can either specify the size of the network
     |          with the following options:
     |              - small == [12,32,64] neurons for the 3 respective layers
     |              - medium == [32,64,128] neurons for the 3 respective layers
     |              - large == [64,128,256] neurons for the 3 respective layers
     |              - custom, where the user supplies a list with the number of nuerons for the respective layers
     |                  i.e. [3,3,3] would have 3 neurons for all 3 layers.
     |      
     |      embedding_dim_aa: int
     |          Learned latent dimensionality of amino-acids.
     |      
     |      embedding_dim_genes: int
     |          Learned latent dimensionality of VDJ genes
     |      
     |      embedding_dim_hla: int
     |          Learned latent dimensionality of HLA
     |      
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from DeepTCR_S_base:
     |  
     |  AUC_Curve(self, by=None, filename='AUC.tif', title=None, plot=True)
     |      AUC Curve for both Sequence and Repertoire/Sample Classifiers
     |      
     |      Inputs
     |      ---------------------------------------
     |      by: str
     |          To show AUC curve for only one class, set this parameter
     |          to the name of the class label one wants to plot.
     |      
     |      filename: str
     |          Filename to save tif file of AUC curve.
     |      
     |      title: str
     |          Optional Title to put on ROC Curve.
     |      
     |      plot: bool
     |          To suppress plotting and just save the data/figure, set to False.
     |      
     |      Returns
     |      
     |      self.AUC_DF: Pandas Dataframe
     |          AUC scores are returned for each class.
     |      
     |      In addition to plotting the ROC Curve, the AUC's are saved
     |      to a csv file in the results directory called 'AUC.csv'
     |      
     |      ---------------------------------------
     |  
     |  Representative_Sequences(self, top_seq=10, motif_seq=5, unique=False)
     |      Identify most highly predicted sequences for each class and corresponding motifs.
     |      
     |      This method allows the user to query which sequences were most predicted to belong to a given class along
     |      with the motifs that were learned for these representative sequences.
     |      Of note, this method only reports sequences that were in the test set so as not to return highly predicted
     |      sequences that were over-fit in the training set. To obtain the highest predictd sequences in all the data,
     |      run a K-fold cross-validation or Monte-Carlo cross-validation before running this method. In this way,
     |      the predicted probability will have been assigned to a sequence only when it was in the independent test set.
     |      
     |      In the case of a regression task, the representative sequences for the 'high' and 'low' values for the regressio
     |      model are returned in the Rep_Seq Dict.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      top_seq: int
     |          The number of top sequences to show for each class.
     |      
     |      motif_seq: int
     |          The number of sequences to use to generate each motif. The more sequences, the possibly more noisy
     |          the seq_logo will be.
     |      
     |      unique: bool
     |          To only select for uniquely enriched motifs for a given class, set this parameter to True.
     |          Otherwise, this method will return the magnitude of enriched motifs of one class vs all other classes.
     |          To learn more specific/uniquely defining motifs, set this parameter to True at the expense of returning less
     |          motifs.
     |      
     |      Returns
     |      
     |      self.Rep_Seq: dictionary of dataframes
     |          This dictionary of dataframes holds for each class the top sequences and their respective
     |          probabiltiies for all classes. These dataframes can also be found in the results folder under Rep_Sequences.
     |      
     |      self.Rep_Seq_Features_(alpha/beta): dictionary of dataframes
     |          This dictionary of dataframes holds information for which features were uniquely enriched
     |          for each class.
     |      
     |      Furthermore, the motifs are written in the results directory underneath the Motifs folder. To find the beta
     |      motifs for a given class, look under Motifs/beta/class_name/. These fasta files are labeled by the magnitude
     |      enrichment of that given feature for that given class followed by the number name of the feature. These fasta files
     |      can then be visualized via weblogos at the following site: "https://weblogo.berkeley.edu/logo.cgi"
     |      
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from DeepTCR_base:
     |  
     |  Get_Data(self, directory, Load_Prev_Data=False, classes=None, type_of_data_cut='Fraction_Response', data_cut=1.0, n_jobs=40, aa_column_alpha=None, aa_column_beta=None, count_column=None, sep='\t', aggregate_by_aa=True, v_alpha_column=None, j_alpha_column=None, v_beta_column=None, j_beta_column=None, d_beta_column=None, p=None, hla=None)
     |      Get Data for DeepTCR
     |      
     |      Parse Data into appropriate inputs for neural network from directories where data is stored.
     |      
     |      Inputs
     |      ---------------------------------------
     |      directory: str
     |          Path to directory with folders with tsv files are present
     |          for analysis. Folders names become labels for files within them.
     |      
     |      Load_Prev_Data: bool
     |          Loads Previous Data.
     |      
     |      classes: list
     |          Optional selection of input of which sub-directories to use for analysis.
     |      
     |      
     |      type_of_data_cut: str
     |          Method by which one wants to sample from the TCRSeq File.
     |      
     |          Options are:
     |              Fraction_Response: A fraction (0 - 1) that samples the top fraction of the file by reads. For example,
     |              if one wants to sample the top 25% of reads, one would use this threshold with a data_cut = 0.25. The idea
     |              of this sampling is akin to sampling a fraction of cells from the file.
     |      
     |              Frequency_Cut: If one wants to select clones above a given frequency threshold, one would use this threshold.
     |              For example, if one wanted to only use clones about 1%, one would enter a data_cut value of 0.01.
     |      
     |              Num_Seq: If one wants to take the top N number of clones, one would use this threshold. For example,
     |              if one wanted to select the top 10 amino acid clones from each file, they would enter a data_cut value of 10.
     |      
     |              Read_Cut: If one wants to take amino acid clones with at least a certain number of reads, one would use
     |              this threshold. For example, if one wanted to only use clones with at least 10 reads,they would enter a data_cut value of 10.
     |      
     |              Read_Sum: IF one wants to take a given number of reads from each file, one would use this threshold. For example,
     |              if one wants to use the sequences comprising the top 100 reads of hte file, they would enter a data_cut value of 100.
     |      
     |      data_cut: float or int
     |          Value  associated with type_of_data_cut parameter.
     |      
     |      n_jobs: int
     |          Number of processes to use for parallelized operations.
     |      
     |      aa_column_alpha: int
     |          Column where alpha chain amino acid data is stored. (0-indexed)
     |      
     |      aa_column_beta: int
     |          Column where beta chain amino acid data is stored.(0-indexed)
     |      
     |      count_column: int
     |          Column where counts are stored.
     |      
     |      sep: str
     |          Type of delimiter used in file with TCRSeq data.
     |      
     |      aggregate_by_aa: bool
     |          Choose to aggregate sequences by unique amino-acid. Defaults to True. If set to False, will allow duplicates
     |          of the same amino acid sequence given it comes from different nucleotide clones.
     |      
     |      v_alpha_column: int
     |          Column where v_alpha gene information is stored.
     |      
     |      j_alpha_column: int
     |          Column where j_alpha gene information is stored.
     |      
     |      v_beta_column: int
     |          Column where v_beta gene information is stored.
     |      
     |      d_beta_column: int
     |          Column where d_beta gene information is stored.
     |      
     |      j_beta_column: int
     |          Column where j_beta gene information is stored.
     |      
     |      p: multiprocessing pool object
     |          For parellelized operations, one can pass a multiprocessing pool object
     |          to this method.
     |      
     |      hla: str
     |          In order to use HLA information as part of the TCR-seq representation, one can provide
     |          a csv file where the first column is the file name and the remaining columns hold HLA alleles
     |          for each file. By including HLA information for each repertoire being analyzed, one is able to
     |          find a representation of TCR-Seq that is more meaningful across repertoires with different HLA
     |          backgrounds.
     |      
     |      
     |      Returns
     |      
     |      self.alpha_sequences: ndarray
     |          array with alpha sequences (if provided)
     |      
     |      self.beta_sequences: ndarray
     |          array with beta sequences (if provided)
     |      
     |      self.class_id: ndarray
     |          array with sequence class labels
     |      
     |      self.sample_id: ndarray
     |          array with sequence file labels
     |      
     |      self.freq: ndarray
     |          array with sequence frequencies from samples
     |      
     |      self.counts: ndarray
     |          array with sequence counts from samples
     |      
     |      self.(v/d/j)_(alpha/beta): ndarray
     |          array with sequence (v/d/j)-(alpha/beta) usage
     |      
     |      ---------------------------------------
     |  
     |  Load_Data(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, class_labels=None, sample_labels=None, freq=None, counts=None, Y=None, p=None, hla=None, w=None)
     |      Load Data programatically into DeepTCR.
     |      
     |      DeepTCR allows direct user input of sequence data for DeepTCR analysis. By using this method,
     |      a user can load numpy arrays with relevant TCRSeq data for analysis.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      class_labels: ndarray of strings
     |          A 1d array with class labels for the sequence (i.e. antigen-specificities)
     |      
     |      sample_labels: ndarray of strings
     |          A 1d array with sample labels for the sequence. (i.e. when loading data from different samples)
     |      
     |      counts: ndarray of ints
     |          A 1d array with the counts for each sequence, in the case they come from samples.
     |      
     |      freq: ndarray of float values
     |          A 1d array with the frequencies for each sequence, in the case they come from samples.
     |      
     |      Y: ndarray of float values
     |          In the case one wants to regress TCR sequences against a numerical label, one can provide
     |          these numerical values for this input. As of latest release, regression is only available
     |          for sequence classifier.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      w: ndarray
     |          optional set of weights for training of autoencoder
     |      
     |      Returns
     |      
     |      self.alpha_sequences: ndarray
     |          array with alpha sequences (if provided)
     |      
     |      self.beta_sequences: ndarray
     |          array with beta sequences (if provided)
     |      
     |      self.label_id: ndarray
     |          array with sequence class labels
     |      
     |      self.file_id: ndarray
     |          array with sequence file labels
     |      
     |      self.freq: ndarray
     |          array with sequence frequencies from samples
     |      
     |      self.counts: ndarray
     |          array with sequence counts from samples
     |      
     |      self.(v/d/j)_(alpha/beta): ndarray
     |          array with sequence (v/d/j)-(alpha/beta) usage
     |      
     |      ---------------------------------------
     |  
     |  Sequence_Inference(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, p=None, hla=None, batch_size=10000)
     |      Predicting outputs of sequence models on new data
     |      
     |      This method allows a user to take a pre-trained autoencoder/sequence classifier
     |      and generate outputs from the model on new data. For the autoencoder, this returns
     |      the features from the latent space. For the sequence classifier, it is the probability
     |      of belonging to each class.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      batch_size: int
     |          Batch size for inference.
     |      
     |      Returns
     |      
     |      features: array
     |          An array that contains n x latent_dim containing features for all sequences
     |      
     |      ---------------------------------------
     |  
     |  __init__(self, Name, max_length=40, device='/device:GPU:0')
     |      Initialize Training Object.
     |      
     |      Initializes object and sets initial parameters.
     |      
     |      Inputs
     |      ---------------------------------------
     |      Name: str
     |          Name of the object.
     |      
     |      max_length: int
     |          maximum length of CDR3 sequence
     |      
     |      device: str
     |          In the case user is using tensorflow-gpu, one can
     |          specify the particular device to build the graphs on.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from DeepTCR_base:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from feature_analytics_class:
     |  
     |  Cluster(self, set='all', clustering_method='phenograph', t=None, criterion='distance', linkage_method='ward', write_to_sheets=False, sample=None, n_jobs=1, order_by_linkage=False)
     |      Clustering Sequences by Latent Features
     |      
     |      This method clusters all sequences by learned latent features from
     |      either the variational autoencoder Several clustering algorithms are included including
     |      Phenograph, DBSCAN, or hierarchical clustering. DBSCAN is implemented from the
     |      sklearn package. Hierarchical clustering is implemented from the scipy package.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      clustering_method: str
     |          Clustering algorithm to use to cluster TCR sequences. Options include
     |          phenograph, dbscan, or hierarchical. When using dbscan or hierarchical clustering,
     |          a variety of thresholds are used to find an optimimum silhoutte score before using a final
     |          clustering threshold when t value is not provided.
     |      
     |      t: float
     |          If t is provided, this is used as a distance threshold for hierarchical clustering or the eps
     |          value for dbscan.
     |      
     |      criterion: str
     |          Clustering criterion as allowed by fcluster function
     |          in scipy.cluster.hierarchy module. (Used in hierarchical clustering).
     |      
     |      linkage_method: str
     |          method parameter for linkage as allowed by scipy.cluster.hierarchy.linkage
     |      
     |      write_to_sheets: bool
     |          To write clusters to separate csv files in folder named 'Clusters' under results folder, set to True.
     |          Additionally, if set to True, a csv file will be written in results directory that contains the frequency contribution
     |          of each cluster to each sample.
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      order_by_linkage: bool
     |          To list sequences in the cluster dataframes by how they are related via ward's linakge,
     |          set this value to True. Otherwise, each cluster dataframe will list the sequences by the order they
     |          were loaded into DeepTCR.
     |      
     |      Returns
     |      
     |      self.Cluster_DFs: list of Pandas dataframes
     |          Clusters by sequences/label
     |      
     |      self.var: list
     |          Variance of lengths in each cluster
     |      
     |      self.Cluster_Frequencies: Pandas dataframe
     |          A dataframe containing the frequency contribution of each cluster to each sample.
     |      
     |      self.Cluster_Assignments: ndarray
     |          Array with cluster assignments by number.
     |      
     |      ---------------------------------------
     |  
     |  Motif_Identification(self, group, p_val_threshold=0.05, by_samples=False, top_seq=10)
     |      Motif Identification Supervised Classifiers
     |      
     |      This method looks for enriched features in the predetermined group
     |      and returns fasta files in directory to be used with "https://weblogo.berkeley.edu/logo.cgi"
     |      to produce seqlogos.
     |      
     |      Inputs
     |      ---------------------------------------
     |      group: string
     |          Class for analyzing enriched motifs.
     |      
     |      p_val_threshold: float
     |          Significance threshold for enriched features/motifs for
     |          Mann-Whitney UTest.
     |      
     |      by_samples: bool
     |          To run a motif identification that looks for enriched motifs at the sample
     |          instead of the seuence level, set this parameter to True. Otherwise, the enrichment
     |          analysis will be done at the sequence level.
     |      
     |      top_seq: int
     |          The number of sequences from which to derive the learned motifs. The larger the number,
     |          the more noisy the motif logo may be.
     |      
     |      Returns
     |      ---------------------------------------
     |      
     |      self.(alpha/beta)_group_features: Pandas Dataframe
     |          Sequences used to determine motifs in fasta files
     |          are stored in this dataframe where column names represent
     |          the feature number.
     |  
     |  Structural_Diversity(self, sample=None, n_jobs=1)
     |      Structural Diversity Measurements
     |      
     |      This method first clusters sequences via the phenograph algorithm before computing
     |      the number of clusters and entropy of the data over these clusters to obtain a measurement
     |      of the structural diversity within a repertoire.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      Returns
     |      
     |      self.Structural_Diversity_DF: Pandas dataframe
     |          A dataframe containing the number of clusters and entropy in each sample
     |      
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from vis_class:
     |  
     |  HeatMap_Samples(self, set='all', filename='Heatmap_Samples.tif', Weight_by_Freq=True, color_dict=None, labels=True, font_scale=1.0)
     |      HeatMap of Samples
     |      
     |      This method creates a heatmap/clustermap for samples by latent features
     |      for the unsupervised deep learning methods.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      filename: str
     |          Name of file to save heatmap.
     |      
     |      Weight_by_Freq: bool
     |          Option to weight each sequence used in aggregate measure
     |          of feature across sample by its frequency.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      labels: bool
     |          Option to show names of samples on y-axis of heatmap.
     |      
     |      font_scale: float
     |          This parameter controls the font size of the row labels. If there are many rows, one can make this value
     |          smaller to get better labeling of the rows.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  HeatMap_Sequences(self, set='all', filename='Heatmap_Sequences.tif', sample_num=None, sample_num_per_class=None, color_dict=None)
     |      HeatMap of Sequences
     |      
     |      This method creates a heatmap/clustermap for sequences by latent features
     |      for the unsupervised deep lerning methods.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      filename: str
     |          Name of file to save heatmap.
     |      
     |      sample_num: int
     |          Number of events to randomly sample for heatmap.
     |      
     |      sample_num_per_class: int
     |          Number of events to randomly sample per class for heatmap.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  Repertoire_Dendrogram(self, set='all', distance_metric='KL', sample=None, n_jobs=1, color_dict=None, dendrogram_radius=0.32, repertoire_radius=0.4, linkage_method='ward', gridsize=24, Load_Prev_Data=False, filename=None, sample_labels=False, gaussian_sigma=0.5, vmax=0.01, n_pad=5, lw=None, log_scale=False)
     |      Repertoire Dendrogram
     |      
     |      This method creates a visualization that shows and compares the distribution
     |      of the sample repertoires via UMAP and provided distance metric. The underlying
     |      algorithm first applied phenograph clustering to determine the proportions of the sample
     |      within a given cluster. Then a distance metric is used to compare how far two samples are
     |      based on their cluster proportions. Various metrics can be provided here such as KL-divergence,
     |      Correlation, and Euclidean.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      distance_metric = str
     |          Provided distance metric to determine repertoire-level distance from cluster proportions.
     |          Options include = (KL,correlation,euclidean,wasserstein,JS).
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      dendrogram_radius: float
     |          The radius of the dendrogram in the figure. This will usually require some adjustment
     |          given the number of samples.
     |      
     |      repertoire_radius: float
     |          The radius of the repertoire plots in the figure. This will usually require some adjustment
     |          given the number of samples.
     |      
     |      linkage_method: str
     |          linkage method used by scipy's linkage function
     |      
     |      gridsize: int
     |          This parameter modifies the granularity of the hexbins for the repertoire density plots.
     |      
     |      Load_Prev_Data: bool
     |          If method has been run before, one can load previous data used to construct the figure for
     |          faster figure creation. This is helpful when trying to format the figure correctly and will require
     |          the user to run the method multiple times.
     |      
     |      filename: str
     |          To save dendrogram plot to results folder, enter a name for the file and the dendrogram
     |          will be saved to the results directory.
     |          i.e. dendrogram.png
     |      
     |      sample_labels: bool
     |          To show the sample labels on the dendrogram, set to True.
     |      
     |      gaussian_sigma: float
     |          The amount of blur to introduce in the plots.
     |      
     |      vmax: float
     |          Highest color density value. Color scales from 0 to vmax (i.e. larger vmax == dimmer plot)
     |      
     |      lw: float
     |          The width of the circle edge around each sample.
     |      
     |      log_scale: bool
     |          To plot the log of the counts for the UMAP density plot, set this value to True. This can be
     |          particularly helpful for visualization if the populations are very clonal.
     |      
     |      Returns
     |      
     |      self.pairwise_distances: Pandas dataframe
     |          Pairwise distances of all samples
     |      ---------------------------------------
     |  
     |  UMAP_Plot(self, set='all', by_class=False, by_cluster=False, by_sample=False, freq_weight=False, show_legend=True, scale=100, Load_Prev_Data=False, alpha=1.0, sample=None, filename=None, prob_plot=None)
     |      UMAP vizualisation of TCR Sequences
     |      
     |      This method displays the sequences in a 2-dimensional UMAP where the user can color code points by
     |      class label, sample label, or prior computing clustering solution. Size of points can also be made to be proportional
     |      to frequency of sequence within sample.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      by_class: bool
     |          To color the points by their class label, set to True.
     |      
     |      by_sample: bool
     |          To color the points by their sample lebel, set to True.
     |      
     |      by_cluster:bool
     |          To color the points by the prior computed clustering solution, set to True.
     |      
     |      freq_weight: bool
     |          To scale size of points proportionally to their frequency, set to True.
     |      
     |      show_legend: bool
     |          To display legend, set to True.
     |      
     |      scale: float
     |          To change size of points, change scale parameter. Is particularly useful
     |          when finding good display size when points are scaled by frequency.
     |      
     |      Load_Prev_Data: bool
     |          If method was run before, one can rerun this method with this parameter set
     |          to True to bypass recomputing the UMAP projection. Useful for generating
     |          different versions of the plot on the same UMAP representation.
     |      
     |      alpha: float
     |          Value between 0-1 that controls transparency of points.
     |      
     |      sample: int
     |          Number of events to sub-sample for visualization.
     |      
     |      filename: str
     |          To save umap plot to results folder, enter a name for the file and the umap
     |          will be saved to the results directory.
     |          i.e. umap.png
     |      
     |      prob_plot: str
     |          To plot the predicted probabilities for the sequences as an additional heatmap, specify
     |          the class probability one wants to visualize (i.e. if the class of interest is class A, input
     |          'A' as a string). Of note, only probabilities determined from the sequences in the test set are
     |          displayed as a means of not showing over-fit probabilities. Therefore, it is best to use this parameter
     |          when the set parameter is turned to 'test'.
     |      
     |      
     |      Returns
     |      
     |      ---------------------------------------
    
    class DeepTCR_S_base(DeepTCR_base, feature_analytics_class, vis_class)
     |  Method resolution order:
     |      DeepTCR_S_base
     |      DeepTCR_base
     |      feature_analytics_class
     |      vis_class
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  AUC_Curve(self, by=None, filename='AUC.tif', title=None, plot=True)
     |      AUC Curve for both Sequence and Repertoire/Sample Classifiers
     |      
     |      Inputs
     |      ---------------------------------------
     |      by: str
     |          To show AUC curve for only one class, set this parameter
     |          to the name of the class label one wants to plot.
     |      
     |      filename: str
     |          Filename to save tif file of AUC curve.
     |      
     |      title: str
     |          Optional Title to put on ROC Curve.
     |      
     |      plot: bool
     |          To suppress plotting and just save the data/figure, set to False.
     |      
     |      Returns
     |      
     |      self.AUC_DF: Pandas Dataframe
     |          AUC scores are returned for each class.
     |      
     |      In addition to plotting the ROC Curve, the AUC's are saved
     |      to a csv file in the results directory called 'AUC.csv'
     |      
     |      ---------------------------------------
     |  
     |  Representative_Sequences(self, top_seq=10, motif_seq=5, unique=False)
     |      Identify most highly predicted sequences for each class and corresponding motifs.
     |      
     |      This method allows the user to query which sequences were most predicted to belong to a given class along
     |      with the motifs that were learned for these representative sequences.
     |      Of note, this method only reports sequences that were in the test set so as not to return highly predicted
     |      sequences that were over-fit in the training set. To obtain the highest predictd sequences in all the data,
     |      run a K-fold cross-validation or Monte-Carlo cross-validation before running this method. In this way,
     |      the predicted probability will have been assigned to a sequence only when it was in the independent test set.
     |      
     |      In the case of a regression task, the representative sequences for the 'high' and 'low' values for the regressio
     |      model are returned in the Rep_Seq Dict.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      top_seq: int
     |          The number of top sequences to show for each class.
     |      
     |      motif_seq: int
     |          The number of sequences to use to generate each motif. The more sequences, the possibly more noisy
     |          the seq_logo will be.
     |      
     |      unique: bool
     |          To only select for uniquely enriched motifs for a given class, set this parameter to True.
     |          Otherwise, this method will return the magnitude of enriched motifs of one class vs all other classes.
     |          To learn more specific/uniquely defining motifs, set this parameter to True at the expense of returning less
     |          motifs.
     |      
     |      Returns
     |      
     |      self.Rep_Seq: dictionary of dataframes
     |          This dictionary of dataframes holds for each class the top sequences and their respective
     |          probabiltiies for all classes. These dataframes can also be found in the results folder under Rep_Sequences.
     |      
     |      self.Rep_Seq_Features_(alpha/beta): dictionary of dataframes
     |          This dictionary of dataframes holds information for which features were uniquely enriched
     |          for each class.
     |      
     |      Furthermore, the motifs are written in the results directory underneath the Motifs folder. To find the beta
     |      motifs for a given class, look under Motifs/beta/class_name/. These fasta files are labeled by the magnitude
     |      enrichment of that given feature for that given class followed by the number name of the feature. These fasta files
     |      can then be visualized via weblogos at the following site: "https://weblogo.berkeley.edu/logo.cgi"
     |      
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from DeepTCR_base:
     |  
     |  Get_Data(self, directory, Load_Prev_Data=False, classes=None, type_of_data_cut='Fraction_Response', data_cut=1.0, n_jobs=40, aa_column_alpha=None, aa_column_beta=None, count_column=None, sep='\t', aggregate_by_aa=True, v_alpha_column=None, j_alpha_column=None, v_beta_column=None, j_beta_column=None, d_beta_column=None, p=None, hla=None)
     |      Get Data for DeepTCR
     |      
     |      Parse Data into appropriate inputs for neural network from directories where data is stored.
     |      
     |      Inputs
     |      ---------------------------------------
     |      directory: str
     |          Path to directory with folders with tsv files are present
     |          for analysis. Folders names become labels for files within them.
     |      
     |      Load_Prev_Data: bool
     |          Loads Previous Data.
     |      
     |      classes: list
     |          Optional selection of input of which sub-directories to use for analysis.
     |      
     |      
     |      type_of_data_cut: str
     |          Method by which one wants to sample from the TCRSeq File.
     |      
     |          Options are:
     |              Fraction_Response: A fraction (0 - 1) that samples the top fraction of the file by reads. For example,
     |              if one wants to sample the top 25% of reads, one would use this threshold with a data_cut = 0.25. The idea
     |              of this sampling is akin to sampling a fraction of cells from the file.
     |      
     |              Frequency_Cut: If one wants to select clones above a given frequency threshold, one would use this threshold.
     |              For example, if one wanted to only use clones about 1%, one would enter a data_cut value of 0.01.
     |      
     |              Num_Seq: If one wants to take the top N number of clones, one would use this threshold. For example,
     |              if one wanted to select the top 10 amino acid clones from each file, they would enter a data_cut value of 10.
     |      
     |              Read_Cut: If one wants to take amino acid clones with at least a certain number of reads, one would use
     |              this threshold. For example, if one wanted to only use clones with at least 10 reads,they would enter a data_cut value of 10.
     |      
     |              Read_Sum: IF one wants to take a given number of reads from each file, one would use this threshold. For example,
     |              if one wants to use the sequences comprising the top 100 reads of hte file, they would enter a data_cut value of 100.
     |      
     |      data_cut: float or int
     |          Value  associated with type_of_data_cut parameter.
     |      
     |      n_jobs: int
     |          Number of processes to use for parallelized operations.
     |      
     |      aa_column_alpha: int
     |          Column where alpha chain amino acid data is stored. (0-indexed)
     |      
     |      aa_column_beta: int
     |          Column where beta chain amino acid data is stored.(0-indexed)
     |      
     |      count_column: int
     |          Column where counts are stored.
     |      
     |      sep: str
     |          Type of delimiter used in file with TCRSeq data.
     |      
     |      aggregate_by_aa: bool
     |          Choose to aggregate sequences by unique amino-acid. Defaults to True. If set to False, will allow duplicates
     |          of the same amino acid sequence given it comes from different nucleotide clones.
     |      
     |      v_alpha_column: int
     |          Column where v_alpha gene information is stored.
     |      
     |      j_alpha_column: int
     |          Column where j_alpha gene information is stored.
     |      
     |      v_beta_column: int
     |          Column where v_beta gene information is stored.
     |      
     |      d_beta_column: int
     |          Column where d_beta gene information is stored.
     |      
     |      j_beta_column: int
     |          Column where j_beta gene information is stored.
     |      
     |      p: multiprocessing pool object
     |          For parellelized operations, one can pass a multiprocessing pool object
     |          to this method.
     |      
     |      hla: str
     |          In order to use HLA information as part of the TCR-seq representation, one can provide
     |          a csv file where the first column is the file name and the remaining columns hold HLA alleles
     |          for each file. By including HLA information for each repertoire being analyzed, one is able to
     |          find a representation of TCR-Seq that is more meaningful across repertoires with different HLA
     |          backgrounds.
     |      
     |      
     |      Returns
     |      
     |      self.alpha_sequences: ndarray
     |          array with alpha sequences (if provided)
     |      
     |      self.beta_sequences: ndarray
     |          array with beta sequences (if provided)
     |      
     |      self.class_id: ndarray
     |          array with sequence class labels
     |      
     |      self.sample_id: ndarray
     |          array with sequence file labels
     |      
     |      self.freq: ndarray
     |          array with sequence frequencies from samples
     |      
     |      self.counts: ndarray
     |          array with sequence counts from samples
     |      
     |      self.(v/d/j)_(alpha/beta): ndarray
     |          array with sequence (v/d/j)-(alpha/beta) usage
     |      
     |      ---------------------------------------
     |  
     |  Load_Data(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, class_labels=None, sample_labels=None, freq=None, counts=None, Y=None, p=None, hla=None, w=None)
     |      Load Data programatically into DeepTCR.
     |      
     |      DeepTCR allows direct user input of sequence data for DeepTCR analysis. By using this method,
     |      a user can load numpy arrays with relevant TCRSeq data for analysis.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      class_labels: ndarray of strings
     |          A 1d array with class labels for the sequence (i.e. antigen-specificities)
     |      
     |      sample_labels: ndarray of strings
     |          A 1d array with sample labels for the sequence. (i.e. when loading data from different samples)
     |      
     |      counts: ndarray of ints
     |          A 1d array with the counts for each sequence, in the case they come from samples.
     |      
     |      freq: ndarray of float values
     |          A 1d array with the frequencies for each sequence, in the case they come from samples.
     |      
     |      Y: ndarray of float values
     |          In the case one wants to regress TCR sequences against a numerical label, one can provide
     |          these numerical values for this input. As of latest release, regression is only available
     |          for sequence classifier.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      w: ndarray
     |          optional set of weights for training of autoencoder
     |      
     |      Returns
     |      
     |      self.alpha_sequences: ndarray
     |          array with alpha sequences (if provided)
     |      
     |      self.beta_sequences: ndarray
     |          array with beta sequences (if provided)
     |      
     |      self.label_id: ndarray
     |          array with sequence class labels
     |      
     |      self.file_id: ndarray
     |          array with sequence file labels
     |      
     |      self.freq: ndarray
     |          array with sequence frequencies from samples
     |      
     |      self.counts: ndarray
     |          array with sequence counts from samples
     |      
     |      self.(v/d/j)_(alpha/beta): ndarray
     |          array with sequence (v/d/j)-(alpha/beta) usage
     |      
     |      ---------------------------------------
     |  
     |  Sequence_Inference(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, p=None, hla=None, batch_size=10000)
     |      Predicting outputs of sequence models on new data
     |      
     |      This method allows a user to take a pre-trained autoencoder/sequence classifier
     |      and generate outputs from the model on new data. For the autoencoder, this returns
     |      the features from the latent space. For the sequence classifier, it is the probability
     |      of belonging to each class.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      batch_size: int
     |          Batch size for inference.
     |      
     |      Returns
     |      
     |      features: array
     |          An array that contains n x latent_dim containing features for all sequences
     |      
     |      ---------------------------------------
     |  
     |  __init__(self, Name, max_length=40, device='/device:GPU:0')
     |      Initialize Training Object.
     |      
     |      Initializes object and sets initial parameters.
     |      
     |      Inputs
     |      ---------------------------------------
     |      Name: str
     |          Name of the object.
     |      
     |      max_length: int
     |          maximum length of CDR3 sequence
     |      
     |      device: str
     |          In the case user is using tensorflow-gpu, one can
     |          specify the particular device to build the graphs on.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from DeepTCR_base:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from feature_analytics_class:
     |  
     |  Cluster(self, set='all', clustering_method='phenograph', t=None, criterion='distance', linkage_method='ward', write_to_sheets=False, sample=None, n_jobs=1, order_by_linkage=False)
     |      Clustering Sequences by Latent Features
     |      
     |      This method clusters all sequences by learned latent features from
     |      either the variational autoencoder Several clustering algorithms are included including
     |      Phenograph, DBSCAN, or hierarchical clustering. DBSCAN is implemented from the
     |      sklearn package. Hierarchical clustering is implemented from the scipy package.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      clustering_method: str
     |          Clustering algorithm to use to cluster TCR sequences. Options include
     |          phenograph, dbscan, or hierarchical. When using dbscan or hierarchical clustering,
     |          a variety of thresholds are used to find an optimimum silhoutte score before using a final
     |          clustering threshold when t value is not provided.
     |      
     |      t: float
     |          If t is provided, this is used as a distance threshold for hierarchical clustering or the eps
     |          value for dbscan.
     |      
     |      criterion: str
     |          Clustering criterion as allowed by fcluster function
     |          in scipy.cluster.hierarchy module. (Used in hierarchical clustering).
     |      
     |      linkage_method: str
     |          method parameter for linkage as allowed by scipy.cluster.hierarchy.linkage
     |      
     |      write_to_sheets: bool
     |          To write clusters to separate csv files in folder named 'Clusters' under results folder, set to True.
     |          Additionally, if set to True, a csv file will be written in results directory that contains the frequency contribution
     |          of each cluster to each sample.
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      order_by_linkage: bool
     |          To list sequences in the cluster dataframes by how they are related via ward's linakge,
     |          set this value to True. Otherwise, each cluster dataframe will list the sequences by the order they
     |          were loaded into DeepTCR.
     |      
     |      Returns
     |      
     |      self.Cluster_DFs: list of Pandas dataframes
     |          Clusters by sequences/label
     |      
     |      self.var: list
     |          Variance of lengths in each cluster
     |      
     |      self.Cluster_Frequencies: Pandas dataframe
     |          A dataframe containing the frequency contribution of each cluster to each sample.
     |      
     |      self.Cluster_Assignments: ndarray
     |          Array with cluster assignments by number.
     |      
     |      ---------------------------------------
     |  
     |  Motif_Identification(self, group, p_val_threshold=0.05, by_samples=False, top_seq=10)
     |      Motif Identification Supervised Classifiers
     |      
     |      This method looks for enriched features in the predetermined group
     |      and returns fasta files in directory to be used with "https://weblogo.berkeley.edu/logo.cgi"
     |      to produce seqlogos.
     |      
     |      Inputs
     |      ---------------------------------------
     |      group: string
     |          Class for analyzing enriched motifs.
     |      
     |      p_val_threshold: float
     |          Significance threshold for enriched features/motifs for
     |          Mann-Whitney UTest.
     |      
     |      by_samples: bool
     |          To run a motif identification that looks for enriched motifs at the sample
     |          instead of the seuence level, set this parameter to True. Otherwise, the enrichment
     |          analysis will be done at the sequence level.
     |      
     |      top_seq: int
     |          The number of sequences from which to derive the learned motifs. The larger the number,
     |          the more noisy the motif logo may be.
     |      
     |      Returns
     |      ---------------------------------------
     |      
     |      self.(alpha/beta)_group_features: Pandas Dataframe
     |          Sequences used to determine motifs in fasta files
     |          are stored in this dataframe where column names represent
     |          the feature number.
     |  
     |  Structural_Diversity(self, sample=None, n_jobs=1)
     |      Structural Diversity Measurements
     |      
     |      This method first clusters sequences via the phenograph algorithm before computing
     |      the number of clusters and entropy of the data over these clusters to obtain a measurement
     |      of the structural diversity within a repertoire.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      Returns
     |      
     |      self.Structural_Diversity_DF: Pandas dataframe
     |          A dataframe containing the number of clusters and entropy in each sample
     |      
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from vis_class:
     |  
     |  HeatMap_Samples(self, set='all', filename='Heatmap_Samples.tif', Weight_by_Freq=True, color_dict=None, labels=True, font_scale=1.0)
     |      HeatMap of Samples
     |      
     |      This method creates a heatmap/clustermap for samples by latent features
     |      for the unsupervised deep learning methods.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      filename: str
     |          Name of file to save heatmap.
     |      
     |      Weight_by_Freq: bool
     |          Option to weight each sequence used in aggregate measure
     |          of feature across sample by its frequency.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      labels: bool
     |          Option to show names of samples on y-axis of heatmap.
     |      
     |      font_scale: float
     |          This parameter controls the font size of the row labels. If there are many rows, one can make this value
     |          smaller to get better labeling of the rows.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  HeatMap_Sequences(self, set='all', filename='Heatmap_Sequences.tif', sample_num=None, sample_num_per_class=None, color_dict=None)
     |      HeatMap of Sequences
     |      
     |      This method creates a heatmap/clustermap for sequences by latent features
     |      for the unsupervised deep lerning methods.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      filename: str
     |          Name of file to save heatmap.
     |      
     |      sample_num: int
     |          Number of events to randomly sample for heatmap.
     |      
     |      sample_num_per_class: int
     |          Number of events to randomly sample per class for heatmap.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  Repertoire_Dendrogram(self, set='all', distance_metric='KL', sample=None, n_jobs=1, color_dict=None, dendrogram_radius=0.32, repertoire_radius=0.4, linkage_method='ward', gridsize=24, Load_Prev_Data=False, filename=None, sample_labels=False, gaussian_sigma=0.5, vmax=0.01, n_pad=5, lw=None, log_scale=False)
     |      Repertoire Dendrogram
     |      
     |      This method creates a visualization that shows and compares the distribution
     |      of the sample repertoires via UMAP and provided distance metric. The underlying
     |      algorithm first applied phenograph clustering to determine the proportions of the sample
     |      within a given cluster. Then a distance metric is used to compare how far two samples are
     |      based on their cluster proportions. Various metrics can be provided here such as KL-divergence,
     |      Correlation, and Euclidean.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      distance_metric = str
     |          Provided distance metric to determine repertoire-level distance from cluster proportions.
     |          Options include = (KL,correlation,euclidean,wasserstein,JS).
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      dendrogram_radius: float
     |          The radius of the dendrogram in the figure. This will usually require some adjustment
     |          given the number of samples.
     |      
     |      repertoire_radius: float
     |          The radius of the repertoire plots in the figure. This will usually require some adjustment
     |          given the number of samples.
     |      
     |      linkage_method: str
     |          linkage method used by scipy's linkage function
     |      
     |      gridsize: int
     |          This parameter modifies the granularity of the hexbins for the repertoire density plots.
     |      
     |      Load_Prev_Data: bool
     |          If method has been run before, one can load previous data used to construct the figure for
     |          faster figure creation. This is helpful when trying to format the figure correctly and will require
     |          the user to run the method multiple times.
     |      
     |      filename: str
     |          To save dendrogram plot to results folder, enter a name for the file and the dendrogram
     |          will be saved to the results directory.
     |          i.e. dendrogram.png
     |      
     |      sample_labels: bool
     |          To show the sample labels on the dendrogram, set to True.
     |      
     |      gaussian_sigma: float
     |          The amount of blur to introduce in the plots.
     |      
     |      vmax: float
     |          Highest color density value. Color scales from 0 to vmax (i.e. larger vmax == dimmer plot)
     |      
     |      lw: float
     |          The width of the circle edge around each sample.
     |      
     |      log_scale: bool
     |          To plot the log of the counts for the UMAP density plot, set this value to True. This can be
     |          particularly helpful for visualization if the populations are very clonal.
     |      
     |      Returns
     |      
     |      self.pairwise_distances: Pandas dataframe
     |          Pairwise distances of all samples
     |      ---------------------------------------
     |  
     |  UMAP_Plot(self, set='all', by_class=False, by_cluster=False, by_sample=False, freq_weight=False, show_legend=True, scale=100, Load_Prev_Data=False, alpha=1.0, sample=None, filename=None, prob_plot=None)
     |      UMAP vizualisation of TCR Sequences
     |      
     |      This method displays the sequences in a 2-dimensional UMAP where the user can color code points by
     |      class label, sample label, or prior computing clustering solution. Size of points can also be made to be proportional
     |      to frequency of sequence within sample.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      by_class: bool
     |          To color the points by their class label, set to True.
     |      
     |      by_sample: bool
     |          To color the points by their sample lebel, set to True.
     |      
     |      by_cluster:bool
     |          To color the points by the prior computed clustering solution, set to True.
     |      
     |      freq_weight: bool
     |          To scale size of points proportionally to their frequency, set to True.
     |      
     |      show_legend: bool
     |          To display legend, set to True.
     |      
     |      scale: float
     |          To change size of points, change scale parameter. Is particularly useful
     |          when finding good display size when points are scaled by frequency.
     |      
     |      Load_Prev_Data: bool
     |          If method was run before, one can rerun this method with this parameter set
     |          to True to bypass recomputing the UMAP projection. Useful for generating
     |          different versions of the plot on the same UMAP representation.
     |      
     |      alpha: float
     |          Value between 0-1 that controls transparency of points.
     |      
     |      sample: int
     |          Number of events to sub-sample for visualization.
     |      
     |      filename: str
     |          To save umap plot to results folder, enter a name for the file and the umap
     |          will be saved to the results directory.
     |          i.e. umap.png
     |      
     |      prob_plot: str
     |          To plot the predicted probabilities for the sequences as an additional heatmap, specify
     |          the class probability one wants to visualize (i.e. if the class of interest is class A, input
     |          'A' as a string). Of note, only probabilities determined from the sequences in the test set are
     |          displayed as a means of not showing over-fit probabilities. Therefore, it is best to use this parameter
     |          when the set parameter is turned to 'test'.
     |      
     |      
     |      Returns
     |      
     |      ---------------------------------------
    
    class DeepTCR_U(DeepTCR_base, feature_analytics_class, vis_class)
     |  Method resolution order:
     |      DeepTCR_U
     |      DeepTCR_base
     |      feature_analytics_class
     |      vis_class
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  KNN_Repertoire_Classifier(self, folds=5, distance_metric='KL', sample=None, n_jobs=1, plot_metrics=False, plot_type='violin', by_class=False, Load_Prev_Data=False, metrics=['Recall', 'Precision', 'F1_Score', 'AUC'])
     |      K-Nearest Neighbor Repertoire Classifier
     |      
     |      This method uses a K-Nearest Neighbor Classifier to assess the ability to predict a repertoire
     |      label given the structural distribution of the repertoire.The method returns AUC,Precision,Recall, and
     |      F1 Scores for all classes.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      folds: int
     |          Number of folds to train/test K-Nearest Classifier.
     |      
     |      distance_metric = str
     |          Provided distance metric to determine repertoire-level distance from cluster proportions.
     |          Options include = (KL,correlation,euclidean,wasserstein,JS).
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      plot_metrics: bool
     |          Toggle to show the performance metrics
     |      
     |      plot_type: str
     |          Type of plot as taken by seaborn.catplot for kind parameter:
     |          options include (strip,swarm,box,violin,boxen,point,bar,count)
     |      
     |      by_class: bool
     |          Toggle to show the performance metrics by class.
     |      
     |      Load_Prev_Data: bool
     |          If method has been run before, one can load previous data from clustering step to move to KNN
     |          step faster. Can be useful when trying different distance methods to find optimizal distance metric
     |          for a given dataset.
     |      
     |      metrics: list
     |          List of performance measures one wants to compute.
     |          options include AUC, Precision, Recall, F1_Score
     |      
     |      Returns
     |      
     |      self.KNN_Repertoire_DF: Pandas dataframe
     |          Dataframe with all metrics of performance organized by the class label,
     |          metric (i.e. AUC), k-value (from k-nearest neighbors), and the value of the
     |          performance metric.
     |      ---------------------------------------
     |  
     |  KNN_Sequence_Classifier(self, folds=5, k_values=[1, 26, 51, 76, 101, 126, 151, 176, 201, 226, 251, 276, 301, 326, 351, 376, 401, 426, 451, 476], rep=5, plot_metrics=False, by_class=False, plot_type='violin', metrics=['Recall', 'Precision', 'F1_Score', 'AUC'], n_jobs=1, Load_Prev_Data=False)
     |      K-Nearest Neighbor Sequence Classifier
     |      
     |      This method uses a K-Nearest Neighbor Classifier to assess the ability to predict a sequence
     |      label given its sequence features.The method returns AUC,Precision,Recall, and
     |      F1 Scores for all classes.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      folds: int
     |          Number of folds to train/test K-Nearest Classifier.
     |      
     |      k_values: list
     |          List of k for KNN algorithm to assess performance metrics across.
     |      
     |      rep:int
     |          Number of iterations to train KNN classifier for each k-value.
     |      
     |      plot_metrics: bool
     |          Toggle to show the performance metrics
     |      
     |      plot_type: str
     |          Type of plot as taken by seaborn.catplot for kind parameter:
     |          options include (strip,swarm,box,violin,boxen,point,bar,count)
     |      
     |      by_class: bool
     |          Toggle to show the performance metrics by class.
     |      
     |      metrics: list
     |          List of performance measures one wants to compute.
     |          options include AUC, Precision, Recall, F1_Score
     |      
     |      n_jobs: int
     |          Number of workers to set for KNeighborsClassifier.
     |      
     |      Load_Prev_Data: bool
     |          To make new figures from old previously run analysis, set this value to True
     |          after running the method for the first time. This will load previous performance
     |          metrics from previous run.
     |      
     |      Returns
     |      
     |      self.KNN_Sequence_DF: Pandas dataframe
     |          Dataframe with all metrics of performance organized by the class label,
     |          metric (i.e. AUC), k-value (from k-nearest neighbors), and the value of the
     |          performance metric.
     |      ---------------------------------------
     |  
     |  Train_VAE(self, latent_dim=256, batch_size=10000, accuracy_min=None, Load_Prev_Data=False, suppress_output=False, trainable_embedding=True, use_only_gene=False, use_only_seq=False, use_only_hla=False, epochs_min=10, stop_criterion=0.01, stop_criterion_window=30, kernel=3, size_of_net='medium', embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12)
     |      Train Variational Autoencoder (VAE)
     |      
     |      This method trains the network and saves features values for sequences
     |      to create heatmaps.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      latent_dim: int
     |          Number of latent dimensions for VAE.
     |      
     |      batch_size: int
     |          Size of batch to be used for each training iteration of the net.
     |      
     |      accuracy_min: float
     |          Minimum reconstruction accuracy before terminating training.
     |      
     |      Load_Prev_Data: bool
     |          Load previous feature data from prior training.
     |      
     |      suppress_output: bool
     |          To suppress command line output with training statisitcs, set to True.
     |      
     |      trainable_embedding: bool
     |          Toggle to control whether a trainable embedding layer is used or native
     |          one-hot representation for convolutional layers.
     |      
     |      use_only_gene: bool
     |          To only use gene-usage features, set to True.
     |      
     |      use_only_seq: bool
     |          To only use sequence feaures, set to True.
     |      
     |      use_only_hla: bool
     |          To only use hla feaures, set to True.
     |      
     |      epochs_min: int
     |          The minimum number of epochs to train the autoencoder.
     |      
     |      stop_criterion: float
     |          Minimum percent decrease in determined interval (below) to continue
     |          training. Used as early stopping criterion.
     |      
     |      stop_criterion_window: int
     |          The window of data to apply the stopping criterion.
     |      
     |      kernel: int
     |          To specify the motif k-mer of the first layer of the autoencoder, change this
     |          parameter.
     |      
     |      size_of_net: list or str
     |          The convolutional layers of this network have 3 layers for which the use can
     |          modify the number of neurons per layer. The user can either specify the size of the network
     |          with the following options:
     |              - small == [12,32,64] neurons for the 3 respective layers
     |              - medium == [32,64,128] neurons for the 3 respective layers
     |              - large == [64,128,256] neurons for the 3 respective layers
     |              - custom, where the user supplies a list with the number of nuerons for the respective layers
     |                  i.e. [3,3,3] would have 3 neurons for all 3 layers.
     |      
     |      embedding_dim_aa: int
     |          Learned latent dimensionality of amino-acids.
     |      
     |      embedding_dim_genes: int
     |          Learned latent dimensionality of VDJ genes
     |      
     |      embedding_dim_hla: int
     |          Learned latent dimensionality of HLA
     |      
     |      Returns
     |      
     |      self.vae_features: array
     |          An array that contains n x latent_dim containing features for all sequences
     |      
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from DeepTCR_base:
     |  
     |  Get_Data(self, directory, Load_Prev_Data=False, classes=None, type_of_data_cut='Fraction_Response', data_cut=1.0, n_jobs=40, aa_column_alpha=None, aa_column_beta=None, count_column=None, sep='\t', aggregate_by_aa=True, v_alpha_column=None, j_alpha_column=None, v_beta_column=None, j_beta_column=None, d_beta_column=None, p=None, hla=None)
     |      Get Data for DeepTCR
     |      
     |      Parse Data into appropriate inputs for neural network from directories where data is stored.
     |      
     |      Inputs
     |      ---------------------------------------
     |      directory: str
     |          Path to directory with folders with tsv files are present
     |          for analysis. Folders names become labels for files within them.
     |      
     |      Load_Prev_Data: bool
     |          Loads Previous Data.
     |      
     |      classes: list
     |          Optional selection of input of which sub-directories to use for analysis.
     |      
     |      
     |      type_of_data_cut: str
     |          Method by which one wants to sample from the TCRSeq File.
     |      
     |          Options are:
     |              Fraction_Response: A fraction (0 - 1) that samples the top fraction of the file by reads. For example,
     |              if one wants to sample the top 25% of reads, one would use this threshold with a data_cut = 0.25. The idea
     |              of this sampling is akin to sampling a fraction of cells from the file.
     |      
     |              Frequency_Cut: If one wants to select clones above a given frequency threshold, one would use this threshold.
     |              For example, if one wanted to only use clones about 1%, one would enter a data_cut value of 0.01.
     |      
     |              Num_Seq: If one wants to take the top N number of clones, one would use this threshold. For example,
     |              if one wanted to select the top 10 amino acid clones from each file, they would enter a data_cut value of 10.
     |      
     |              Read_Cut: If one wants to take amino acid clones with at least a certain number of reads, one would use
     |              this threshold. For example, if one wanted to only use clones with at least 10 reads,they would enter a data_cut value of 10.
     |      
     |              Read_Sum: IF one wants to take a given number of reads from each file, one would use this threshold. For example,
     |              if one wants to use the sequences comprising the top 100 reads of hte file, they would enter a data_cut value of 100.
     |      
     |      data_cut: float or int
     |          Value  associated with type_of_data_cut parameter.
     |      
     |      n_jobs: int
     |          Number of processes to use for parallelized operations.
     |      
     |      aa_column_alpha: int
     |          Column where alpha chain amino acid data is stored. (0-indexed)
     |      
     |      aa_column_beta: int
     |          Column where beta chain amino acid data is stored.(0-indexed)
     |      
     |      count_column: int
     |          Column where counts are stored.
     |      
     |      sep: str
     |          Type of delimiter used in file with TCRSeq data.
     |      
     |      aggregate_by_aa: bool
     |          Choose to aggregate sequences by unique amino-acid. Defaults to True. If set to False, will allow duplicates
     |          of the same amino acid sequence given it comes from different nucleotide clones.
     |      
     |      v_alpha_column: int
     |          Column where v_alpha gene information is stored.
     |      
     |      j_alpha_column: int
     |          Column where j_alpha gene information is stored.
     |      
     |      v_beta_column: int
     |          Column where v_beta gene information is stored.
     |      
     |      d_beta_column: int
     |          Column where d_beta gene information is stored.
     |      
     |      j_beta_column: int
     |          Column where j_beta gene information is stored.
     |      
     |      p: multiprocessing pool object
     |          For parellelized operations, one can pass a multiprocessing pool object
     |          to this method.
     |      
     |      hla: str
     |          In order to use HLA information as part of the TCR-seq representation, one can provide
     |          a csv file where the first column is the file name and the remaining columns hold HLA alleles
     |          for each file. By including HLA information for each repertoire being analyzed, one is able to
     |          find a representation of TCR-Seq that is more meaningful across repertoires with different HLA
     |          backgrounds.
     |      
     |      
     |      Returns
     |      
     |      self.alpha_sequences: ndarray
     |          array with alpha sequences (if provided)
     |      
     |      self.beta_sequences: ndarray
     |          array with beta sequences (if provided)
     |      
     |      self.class_id: ndarray
     |          array with sequence class labels
     |      
     |      self.sample_id: ndarray
     |          array with sequence file labels
     |      
     |      self.freq: ndarray
     |          array with sequence frequencies from samples
     |      
     |      self.counts: ndarray
     |          array with sequence counts from samples
     |      
     |      self.(v/d/j)_(alpha/beta): ndarray
     |          array with sequence (v/d/j)-(alpha/beta) usage
     |      
     |      ---------------------------------------
     |  
     |  Load_Data(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, class_labels=None, sample_labels=None, freq=None, counts=None, Y=None, p=None, hla=None, w=None)
     |      Load Data programatically into DeepTCR.
     |      
     |      DeepTCR allows direct user input of sequence data for DeepTCR analysis. By using this method,
     |      a user can load numpy arrays with relevant TCRSeq data for analysis.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      class_labels: ndarray of strings
     |          A 1d array with class labels for the sequence (i.e. antigen-specificities)
     |      
     |      sample_labels: ndarray of strings
     |          A 1d array with sample labels for the sequence. (i.e. when loading data from different samples)
     |      
     |      counts: ndarray of ints
     |          A 1d array with the counts for each sequence, in the case they come from samples.
     |      
     |      freq: ndarray of float values
     |          A 1d array with the frequencies for each sequence, in the case they come from samples.
     |      
     |      Y: ndarray of float values
     |          In the case one wants to regress TCR sequences against a numerical label, one can provide
     |          these numerical values for this input. As of latest release, regression is only available
     |          for sequence classifier.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      w: ndarray
     |          optional set of weights for training of autoencoder
     |      
     |      Returns
     |      
     |      self.alpha_sequences: ndarray
     |          array with alpha sequences (if provided)
     |      
     |      self.beta_sequences: ndarray
     |          array with beta sequences (if provided)
     |      
     |      self.label_id: ndarray
     |          array with sequence class labels
     |      
     |      self.file_id: ndarray
     |          array with sequence file labels
     |      
     |      self.freq: ndarray
     |          array with sequence frequencies from samples
     |      
     |      self.counts: ndarray
     |          array with sequence counts from samples
     |      
     |      self.(v/d/j)_(alpha/beta): ndarray
     |          array with sequence (v/d/j)-(alpha/beta) usage
     |      
     |      ---------------------------------------
     |  
     |  Sequence_Inference(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, p=None, hla=None, batch_size=10000)
     |      Predicting outputs of sequence models on new data
     |      
     |      This method allows a user to take a pre-trained autoencoder/sequence classifier
     |      and generate outputs from the model on new data. For the autoencoder, this returns
     |      the features from the latent space. For the sequence classifier, it is the probability
     |      of belonging to each class.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      batch_size: int
     |          Batch size for inference.
     |      
     |      Returns
     |      
     |      features: array
     |          An array that contains n x latent_dim containing features for all sequences
     |      
     |      ---------------------------------------
     |  
     |  __init__(self, Name, max_length=40, device='/device:GPU:0')
     |      Initialize Training Object.
     |      
     |      Initializes object and sets initial parameters.
     |      
     |      Inputs
     |      ---------------------------------------
     |      Name: str
     |          Name of the object.
     |      
     |      max_length: int
     |          maximum length of CDR3 sequence
     |      
     |      device: str
     |          In the case user is using tensorflow-gpu, one can
     |          specify the particular device to build the graphs on.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from DeepTCR_base:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from feature_analytics_class:
     |  
     |  Cluster(self, set='all', clustering_method='phenograph', t=None, criterion='distance', linkage_method='ward', write_to_sheets=False, sample=None, n_jobs=1, order_by_linkage=False)
     |      Clustering Sequences by Latent Features
     |      
     |      This method clusters all sequences by learned latent features from
     |      either the variational autoencoder Several clustering algorithms are included including
     |      Phenograph, DBSCAN, or hierarchical clustering. DBSCAN is implemented from the
     |      sklearn package. Hierarchical clustering is implemented from the scipy package.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      clustering_method: str
     |          Clustering algorithm to use to cluster TCR sequences. Options include
     |          phenograph, dbscan, or hierarchical. When using dbscan or hierarchical clustering,
     |          a variety of thresholds are used to find an optimimum silhoutte score before using a final
     |          clustering threshold when t value is not provided.
     |      
     |      t: float
     |          If t is provided, this is used as a distance threshold for hierarchical clustering or the eps
     |          value for dbscan.
     |      
     |      criterion: str
     |          Clustering criterion as allowed by fcluster function
     |          in scipy.cluster.hierarchy module. (Used in hierarchical clustering).
     |      
     |      linkage_method: str
     |          method parameter for linkage as allowed by scipy.cluster.hierarchy.linkage
     |      
     |      write_to_sheets: bool
     |          To write clusters to separate csv files in folder named 'Clusters' under results folder, set to True.
     |          Additionally, if set to True, a csv file will be written in results directory that contains the frequency contribution
     |          of each cluster to each sample.
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      order_by_linkage: bool
     |          To list sequences in the cluster dataframes by how they are related via ward's linakge,
     |          set this value to True. Otherwise, each cluster dataframe will list the sequences by the order they
     |          were loaded into DeepTCR.
     |      
     |      Returns
     |      
     |      self.Cluster_DFs: list of Pandas dataframes
     |          Clusters by sequences/label
     |      
     |      self.var: list
     |          Variance of lengths in each cluster
     |      
     |      self.Cluster_Frequencies: Pandas dataframe
     |          A dataframe containing the frequency contribution of each cluster to each sample.
     |      
     |      self.Cluster_Assignments: ndarray
     |          Array with cluster assignments by number.
     |      
     |      ---------------------------------------
     |  
     |  Motif_Identification(self, group, p_val_threshold=0.05, by_samples=False, top_seq=10)
     |      Motif Identification Supervised Classifiers
     |      
     |      This method looks for enriched features in the predetermined group
     |      and returns fasta files in directory to be used with "https://weblogo.berkeley.edu/logo.cgi"
     |      to produce seqlogos.
     |      
     |      Inputs
     |      ---------------------------------------
     |      group: string
     |          Class for analyzing enriched motifs.
     |      
     |      p_val_threshold: float
     |          Significance threshold for enriched features/motifs for
     |          Mann-Whitney UTest.
     |      
     |      by_samples: bool
     |          To run a motif identification that looks for enriched motifs at the sample
     |          instead of the seuence level, set this parameter to True. Otherwise, the enrichment
     |          analysis will be done at the sequence level.
     |      
     |      top_seq: int
     |          The number of sequences from which to derive the learned motifs. The larger the number,
     |          the more noisy the motif logo may be.
     |      
     |      Returns
     |      ---------------------------------------
     |      
     |      self.(alpha/beta)_group_features: Pandas Dataframe
     |          Sequences used to determine motifs in fasta files
     |          are stored in this dataframe where column names represent
     |          the feature number.
     |  
     |  Structural_Diversity(self, sample=None, n_jobs=1)
     |      Structural Diversity Measurements
     |      
     |      This method first clusters sequences via the phenograph algorithm before computing
     |      the number of clusters and entropy of the data over these clusters to obtain a measurement
     |      of the structural diversity within a repertoire.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      Returns
     |      
     |      self.Structural_Diversity_DF: Pandas dataframe
     |          A dataframe containing the number of clusters and entropy in each sample
     |      
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from vis_class:
     |  
     |  HeatMap_Samples(self, set='all', filename='Heatmap_Samples.tif', Weight_by_Freq=True, color_dict=None, labels=True, font_scale=1.0)
     |      HeatMap of Samples
     |      
     |      This method creates a heatmap/clustermap for samples by latent features
     |      for the unsupervised deep learning methods.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      filename: str
     |          Name of file to save heatmap.
     |      
     |      Weight_by_Freq: bool
     |          Option to weight each sequence used in aggregate measure
     |          of feature across sample by its frequency.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      labels: bool
     |          Option to show names of samples on y-axis of heatmap.
     |      
     |      font_scale: float
     |          This parameter controls the font size of the row labels. If there are many rows, one can make this value
     |          smaller to get better labeling of the rows.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  HeatMap_Sequences(self, set='all', filename='Heatmap_Sequences.tif', sample_num=None, sample_num_per_class=None, color_dict=None)
     |      HeatMap of Sequences
     |      
     |      This method creates a heatmap/clustermap for sequences by latent features
     |      for the unsupervised deep lerning methods.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      filename: str
     |          Name of file to save heatmap.
     |      
     |      sample_num: int
     |          Number of events to randomly sample for heatmap.
     |      
     |      sample_num_per_class: int
     |          Number of events to randomly sample per class for heatmap.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  Repertoire_Dendrogram(self, set='all', distance_metric='KL', sample=None, n_jobs=1, color_dict=None, dendrogram_radius=0.32, repertoire_radius=0.4, linkage_method='ward', gridsize=24, Load_Prev_Data=False, filename=None, sample_labels=False, gaussian_sigma=0.5, vmax=0.01, n_pad=5, lw=None, log_scale=False)
     |      Repertoire Dendrogram
     |      
     |      This method creates a visualization that shows and compares the distribution
     |      of the sample repertoires via UMAP and provided distance metric. The underlying
     |      algorithm first applied phenograph clustering to determine the proportions of the sample
     |      within a given cluster. Then a distance metric is used to compare how far two samples are
     |      based on their cluster proportions. Various metrics can be provided here such as KL-divergence,
     |      Correlation, and Euclidean.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      distance_metric = str
     |          Provided distance metric to determine repertoire-level distance from cluster proportions.
     |          Options include = (KL,correlation,euclidean,wasserstein,JS).
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      dendrogram_radius: float
     |          The radius of the dendrogram in the figure. This will usually require some adjustment
     |          given the number of samples.
     |      
     |      repertoire_radius: float
     |          The radius of the repertoire plots in the figure. This will usually require some adjustment
     |          given the number of samples.
     |      
     |      linkage_method: str
     |          linkage method used by scipy's linkage function
     |      
     |      gridsize: int
     |          This parameter modifies the granularity of the hexbins for the repertoire density plots.
     |      
     |      Load_Prev_Data: bool
     |          If method has been run before, one can load previous data used to construct the figure for
     |          faster figure creation. This is helpful when trying to format the figure correctly and will require
     |          the user to run the method multiple times.
     |      
     |      filename: str
     |          To save dendrogram plot to results folder, enter a name for the file and the dendrogram
     |          will be saved to the results directory.
     |          i.e. dendrogram.png
     |      
     |      sample_labels: bool
     |          To show the sample labels on the dendrogram, set to True.
     |      
     |      gaussian_sigma: float
     |          The amount of blur to introduce in the plots.
     |      
     |      vmax: float
     |          Highest color density value. Color scales from 0 to vmax (i.e. larger vmax == dimmer plot)
     |      
     |      lw: float
     |          The width of the circle edge around each sample.
     |      
     |      log_scale: bool
     |          To plot the log of the counts for the UMAP density plot, set this value to True. This can be
     |          particularly helpful for visualization if the populations are very clonal.
     |      
     |      Returns
     |      
     |      self.pairwise_distances: Pandas dataframe
     |          Pairwise distances of all samples
     |      ---------------------------------------
     |  
     |  UMAP_Plot(self, set='all', by_class=False, by_cluster=False, by_sample=False, freq_weight=False, show_legend=True, scale=100, Load_Prev_Data=False, alpha=1.0, sample=None, filename=None, prob_plot=None)
     |      UMAP vizualisation of TCR Sequences
     |      
     |      This method displays the sequences in a 2-dimensional UMAP where the user can color code points by
     |      class label, sample label, or prior computing clustering solution. Size of points can also be made to be proportional
     |      to frequency of sequence within sample.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      by_class: bool
     |          To color the points by their class label, set to True.
     |      
     |      by_sample: bool
     |          To color the points by their sample lebel, set to True.
     |      
     |      by_cluster:bool
     |          To color the points by the prior computed clustering solution, set to True.
     |      
     |      freq_weight: bool
     |          To scale size of points proportionally to their frequency, set to True.
     |      
     |      show_legend: bool
     |          To display legend, set to True.
     |      
     |      scale: float
     |          To change size of points, change scale parameter. Is particularly useful
     |          when finding good display size when points are scaled by frequency.
     |      
     |      Load_Prev_Data: bool
     |          If method was run before, one can rerun this method with this parameter set
     |          to True to bypass recomputing the UMAP projection. Useful for generating
     |          different versions of the plot on the same UMAP representation.
     |      
     |      alpha: float
     |          Value between 0-1 that controls transparency of points.
     |      
     |      sample: int
     |          Number of events to sub-sample for visualization.
     |      
     |      filename: str
     |          To save umap plot to results folder, enter a name for the file and the umap
     |          will be saved to the results directory.
     |          i.e. umap.png
     |      
     |      prob_plot: str
     |          To plot the predicted probabilities for the sequences as an additional heatmap, specify
     |          the class probability one wants to visualize (i.e. if the class of interest is class A, input
     |          'A' as a string). Of note, only probabilities determined from the sequences in the test set are
     |          displayed as a means of not showing over-fit probabilities. Therefore, it is best to use this parameter
     |          when the set parameter is turned to 'test'.
     |      
     |      
     |      Returns
     |      
     |      ---------------------------------------
    
    class DeepTCR_WF(DeepTCR_S_base)
     |  Method resolution order:
     |      DeepTCR_WF
     |      DeepTCR_S_base
     |      DeepTCR_base
     |      feature_analytics_class
     |      vis_class
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  Get_Train_Valid_Test(self, test_size=0.25, LOO=None, combine_train_valid=False, random_perm=False)
     |      Train/Valid/Test Splits.
     |      
     |      Divide data for train, valid, test set. Training is used to
     |      train model parameters, validation is used to set early stopping,
     |      and test acts as blackbox independent test set. In the case that
     |      Leave-One-Out (LOO) is set to a value, the valid and test sets
     |      have the same data and early stopping is based on the training loss.
     |      
     |      Inputs
     |      ---------------------------------------
     |      test_size: float
     |          Fraction of sample to be used for valid and test set.
     |      
     |      LOO: int
     |          Number of samples to leave-out in Leave-One-Out Cross-Validation. For example,
     |          when set to 2, 2 samples will be left out for the validation set and 2 samples will be left
     |          out for the test set.
     |      
     |      combine_train_valid: bool
     |          To combine the training and validation partitions into one which will be used for training
     |          and updating the model parameters, set this to True. This will also set the validation partition
     |          to the test partition. Therefore, if setting this parameter to True, change one of the training parameters
     |          to set the stop training criterion (i.e. train_loss_min) to stop training based on the train set.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  K_Fold_CrossVal(self, folds=None, epochs_min=25, batch_size=25, batch_size_update=None, stop_criterion=0.25, stop_criterion_window=10, kernel=5, num_concepts=12, weight_by_class=False, class_weights=None, iterations=None, trainable_embedding=True, accuracy_min=None, train_loss_min=None, combine_train_valid=False, num_fc_layers=0, units_fc=12, drop_out_rate=0.0, suppress_output=False, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, hinge_loss_t=0.0, convergence='validation')
     |      K_Fold Cross-Validation for Whole Sample Classifier
     |      
     |      If the number of samples is small but training the whole sample classifier, one
     |      can use K_Fold Cross Validation to train on all but one before assessing
     |      predictive performance.After this method is run, the AUC_Curve method can be run to
     |      assess the overall performance.
     |      
     |      Inputs
     |      ---------------------------------------
     |      folds: int
     |          Number of Folds
     |      
     |      batch_size: int
     |          Size of batch to be used for each training iteration of the net.
     |      
     |      batch_size_update: int
     |          In the case that the size of the samples are very large, one may not want to update
     |          the weights of the network as often as batches are put onto the gpu. Therefore, if
     |          one wants to update the weights less often than how often the batches of data are put onto the
     |          gpu, one can set this parameter to something other than None. An example would be if batch_size is set to 5
     |          and batch_size_update is set to 30, while only 5 samples will be put on the gpu at a time, the weights will
     |          only be updated after 30 samples have been put on the gpu. This parameter is only relevant when using
     |          gpu's for training and there are memory constraints from very large samples.
     |      
     |      epochs_min: int
     |          Minimum number of epochs for training neural network.
     |      
     |      stop_criterion: float
     |          Minimum percent decrease in determined interval (below) to continue
     |          training. Used as early stopping criterion.
     |      
     |      stop_criterion_window: int
     |          The window of data to apply the stopping criterion.
     |      
     |      kernel: int
     |          Size of convolutional kernel for first layer of convolutions.
     |      
     |      num_concepts: int
     |          Number of concepts for multi-head attention mechanism. Depending on the expected heterogeneity of the
     |          repertoires being analyed, one can adjust this hyperparameter.
     |      
     |      weight_by_class: bool
     |          Option to weight loss by the inverse of the class frequency. Useful for
     |          unbalanced classes.
     |      
     |      class_weights: dict
     |          In order to specify custom weights for each class during training, one
     |          can provide a dictionary with these weights.
     |              i.e. {'A':1.0,'B':2.0'}
     |      
     |      iterations: int
     |          Option to specify how many iterations one wants to complete before
     |          terminating training. Useful for very large datasets.
     |      
     |      trainable_embedding; bool
     |          Toggle to control whether a trainable embedding layer is used or native
     |          one-hot representation for convolutional layers.
     |      
     |      accuracy_min: float
     |          Optional parameter to allow alternative training strategy until minimum
     |          training accuracy is achieved, at which point, training ceases.
     |      
     |      train_loss_min: float
     |          Optional parameter to allow alternative training strategy until minimum
     |          training loss is achieved, at which point, training ceases.
     |      
     |      hinge_loss_t: float
     |          The per sample loss minimum at which the loss of that sample is not used
     |          to penalize the model anymore. In other words, once a per sample loss has hit
     |          this value, it gets set to 0.0.
     |      
     |      convergence: str
     |          This parameter determines which loss to assess the convergence criteria on.
     |          Options are 'validation' or 'training'. This is useful in the case one wants
     |          to change the convergence criteria on the training data when the training and validation
     |          partitions have been combined and used to training the model.
     |      
     |      combine_train_valid: bool
     |          To combine the training and validation partitions into one which will be used for training
     |          and updating the model parameters, set this to True. This will also set the validation partition
     |          to the test partition. Therefore, if setting this parameter to True, change one of the training parameters
     |          to set the stop training criterion (i.e. train_loss_min) to stop training based on the train set.
     |      
     |      num_fc_layers: int
     |          Number of fully connected layers following convolutional layer.
     |      
     |      units_fc: int
     |          Number of nodes per fully-connected layers following convolutional layer.
     |      
     |      drop_out_rate: float
     |          drop out rate for fully connected layers
     |      
     |      suppress_output: bool
     |          To suppress command line output with training statisitcs, set to True.
     |      
     |      use_only_gene: bool
     |          To only use gene-usage features, set to True. This will turn off features from
     |          the sequences.
     |      
     |      use_only_seq: bool
     |          To only use sequence feaures, set to True. This will turn off features learned
     |          from gene usage.
     |      
     |      use_only_hla: bool
     |          To only use hla feaures, set to True.
     |      
     |      size_of_net: list or str
     |          The convolutional layers of this network have 3 layers for which the use can
     |          modify the number of neurons per layer. The user can either specify the size of the network
     |          with the following options:
     |              - small == [12,32,64] neurons for the 3 respective layers
     |              - medium == [32,64,128] neurons for the 3 respective layers
     |              - large == [64,128,256] neurons for the 3 respective layers
     |              - custom, where the user supplies a list with the number of nuerons for the respective layers
     |                  i.e. [3,3,3] would have 3 neurons for all 3 layers.
     |      
     |      embedding_dim_aa: int
     |          Learned latent dimensionality of amino-acids.
     |      
     |      embedding_dim_genes: int
     |          Learned latent dimensionality of VDJ genes
     |      
     |      embedding_dim_hla: int
     |          Learned latent dimensionality of HLA
     |      
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  Monte_Carlo_CrossVal(self, folds=5, test_size=0.25, epochs_min=25, batch_size=25, batch_size_update=None, LOO=None, stop_criterion=0.25, stop_criterion_window=10, kernel=5, num_concepts=12, weight_by_class=False, class_weights=None, trainable_embedding=True, accuracy_min=None, combine_train_valid=False, train_loss_min=None, num_fc_layers=0, units_fc=12, drop_out_rate=0.0, suppress_output=False, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, hinge_loss_t=0.0, convergence='validation', seeds=None, graph_seed=None, batch_seed=None, random_perm=False)
     |      Monte Carlo Cross-Validation for Whole Sample Classifier
     |      
     |      If the number of samples is small but training the whole sample classifier, one
     |      can use Monte Carlo Cross Validation to train a number of iterations before assessing
     |      predictive performance.After this method is run, the AUC_Curve method can be run to
     |      assess the overall performance.
     |      
     |      Inputs
     |      ---------------------------------------
     |      folds: int
     |          Number of iterations for Cross-Validation
     |      
     |      test_size: float
     |          Fraction of sample to be used for valid and test set.
     |      
     |      LOO: int
     |          Number of samples to leave-out in Leave-One-Out Cross-Validation
     |      
     |      batch_size: int
     |          Size of batch to be used for each training iteration of the net.
     |      
     |      batch_size_update: int
     |          In the case that the size of the samples are very large, one may not want to update
     |          the weights of the network as often as batches are put onto the gpu. Therefore, if
     |          one wants to update the weights less often than how often the batches of data are put onto the
     |          gpu, one can set this parameter to something other than None. An example would be if batch_size is set to 5
     |          and batch_size_update is set to 30, while only 5 samples will be put on the gpu at a time, the weights will
     |          only be updated after 30 samples have been put on the gpu. This parameter is only relevant when using
     |          gpu's for training and there are memory constraints from very large samples.
     |      
     |      epochs_min: int
     |          Minimum number of epochs for training neural network.
     |      
     |      stop_criterion: float
     |          Minimum percent decrease in determined interval (below) to continue
     |          training. Used as early stopping criterion.
     |      
     |      stop_criterion_window: int
     |          The window of data to apply the stopping criterion.
     |      
     |      kernel: int
     |          Size of convolutional kernel for first layer of convolutions.
     |      
     |      num_concepts: int
     |          Number of concepts for multi-head attention mechanism. Depending on the expected heterogeneity of the
     |          repertoires being analyed, one can adjust this hyperparameter.
     |      
     |      weight_by_class: bool
     |          Option to weight loss by the inverse of the class frequency. Useful for
     |          unbalanced classes.
     |      
     |      class_weights: dict
     |          In order to specify custom weights for each class during training, one
     |          can provide a dictionary with these weights.
     |              i.e. {'A':1.0,'B':2.0'}
     |      
     |      trainable_embedding; bool
     |          Toggle to control whether a trainable embedding layer is used or native
     |          one-hot representation for convolutional layers.
     |      
     |      accuracy_min: float
     |          Optional parameter to allow alternative training strategy until minimum
     |          training accuracy is achieved, at which point, training ceases.
     |      
     |      train_loss_min: float
     |          Optional parameter to allow alternative training strategy until minimum
     |          training loss is achieved, at which point, training ceases.
     |      
     |      hinge_loss_t: float
     |          The per sample loss minimum at which the loss of that sample is not used
     |          to penalize the model anymore. In other words, once a per sample loss has hit
     |          this value, it gets set to 0.0.
     |      
     |      convergence: str
     |          This parameter determines which loss to assess the convergence criteria on.
     |          Options are 'validation' or 'training'. This is useful in the case one wants
     |          to change the convergence criteria on the training data when the training and validation
     |          partitions have been combined and used to training the model.
     |      
     |      combine_train_valid: bool
     |          To combine the training and validation partitions into one which will be used for training
     |          and updating the model parameters, set this to True. This will also set the validation partition
     |          to the test partition. Therefore, if setting this parameter to True, change one of the training parameters
     |          to set the stop training criterion (i.e. train_loss_min) to stop training based on the train set.
     |      
     |      num_fc_layers: int
     |          Number of fully connected layers following convolutional layer.
     |      
     |      units_fc: int
     |          Number of nodes per fully-connected layers following convolutional layer.
     |      
     |      drop_out_rate: float
     |          drop out rate for fully connected layers
     |      
     |      suppress_output: bool
     |          To suppress command line output with training statisitcs, set to True.
     |      
     |      use_only_gene: bool
     |          To only use gene-usage features, set to True. This will turn off features from
     |          the sequences.
     |      
     |      use_only_seq: bool
     |          To only use sequence feaures, set to True. This will turn off features learned
     |          from gene usage.
     |      
     |      use_only_hla: bool
     |          To only use hla feaures, set to True.
     |      
     |      size_of_net: list or str
     |          The convolutional layers of this network have 3 layers for which the use can
     |          modify the number of neurons per layer. The user can either specify the size of the network
     |          with the following options:
     |              - small == [12,32,64] neurons for the 3 respective layers
     |              - medium == [32,64,128] neurons for the 3 respective layers
     |              - large == [64,128,256] neurons for the 3 respective layers
     |              - custom, where the user supplies a list with the number of nuerons for the respective layers
     |                  i.e. [3,3,3] would have 3 neurons for all 3 layers.
     |      
     |      embedding_dim_aa: int
     |          Learned latent dimensionality of amino-acids.
     |      
     |      embedding_dim_genes: int
     |          Learned latent dimensionality of VDJ genes
     |      
     |      embedding_dim_hla: int
     |          Learned latent dimensionality of HLA
     |      
     |      
     |      Returns
     |      
     |      self.DFs_pred: dict of dataframes
     |          This method returns the samples in the test sets of the Monte-Carlo and their
     |          predicted probabilities for each class.
     |      ---------------------------------------
     |  
     |  Sample_Inference(self, sample_labels, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, p=None, hla=None, freq=None, counts=None, batch_size=10)
     |      Predicting outputs of sample/repertoire model on new data
     |      
     |      This method allows a user to take a pre-trained sample/repertoire classifier
     |      and generate outputs from the model on new data. This will return predicted probabilites
     |      for the given classes for the new data.
     |      
     |      To load data from directories, one can use the Get_Data method from the base class to automatically
     |      format the data into the proper format to be then input into this method.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      sample_labels: ndarray of strings
     |          A 1d array with sample labels for the sequence.
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      counts: ndarray of ints
     |          A 1d array with the counts for each sequence.
     |      
     |      freq: ndarray of float values
     |          A 1d array with the frequencies for each sequence.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      batch_size: int
     |          Batch size for inference.
     |      
     |      Returns
     |      
     |      out:dict
     |          A dictionary of predicted probabilities for the respective classes
     |      
     |      self.Inference_Pred: ndarray
     |          An array with the predicted probabilites for all classes
     |      
     |      self.Inference_Sample_List: ndarray
     |          An array with the sample list corresponding the predicted probabilities.
     |      
     |      ---------------------------------------
     |  
     |  Train(self, batch_size=25, batch_size_update=None, epochs_min=25, stop_criterion=0.25, stop_criterion_window=10, kernel=5, num_concepts=12, weight_by_class=False, class_weights=None, trainable_embedding=True, accuracy_min=None, train_loss_min=None, num_fc_layers=0, units_fc=12, drop_out_rate=0.0, suppress_output=False, use_only_seq=False, use_only_gene=False, use_only_hla=False, size_of_net='medium', embedding_dim_aa=64, embedding_dim_genes=48, embedding_dim_hla=12, hinge_loss_t=0.0, convergence='validation')
     |      Train Whole-Sample Classifier
     |      
     |      This method trains the network and saves features values at the
     |      end of training for motif analysis.
     |      
     |      Inputs
     |      ---------------------------------------
     |      batch_size: int
     |          Size of batch to be used for each training iteration of the net.
     |      
     |      batch_size_update: int
     |          In the case that the size of the samples are very large, one may not want to update
     |          the weights of the network as often as batches are put onto the gpu. Therefore, if
     |          one wants to update the weights less often than how often the batches of data are put onto the
     |          gpu, one can set this parameter to something other than None. An example would be if batch_size is set to 5
     |          and batch_size_update is set to 30, while only 5 samples will be put on the gpu at a time, the weights will
     |          only be updated after 30 samples have been put on the gpu. This parameter is only relevant when using
     |          gpu's for training and there are memory constraints from very large samples.
     |      
     |      epochs_min: int
     |          Minimum number of epochs for training neural network.
     |      
     |      stop_criterion: float
     |          Minimum percent decrease in determined interval (below) to continue
     |          training. Used as early stopping criterion.
     |      
     |      stop_criterion_window: int
     |          The window of data to apply the stopping criterion.
     |      
     |      kernel: int
     |          Size of convolutional kernel for first layer of convolutions.
     |      
     |      num_concepts: int
     |          Number of concepts for multi-head attention mechanism. Depending on the expected heterogeneity of the
     |          repertoires being analyed, one can adjust this hyperparameter.
     |      
     |      weight_by_class: bool
     |          Option to weight loss by the inverse of the class frequency. Useful for
     |          unbalanced classes.
     |      
     |      class_weights: dict
     |          In order to specify custom weights for each class during training, one
     |          can provide a dictionary with these weights.
     |              i.e. {'A':1.0,'B':2.0'}
     |      
     |      trainable_embedding; bool
     |          Toggle to control whether a trainable embedding layer is used or native
     |          one-hot representation for convolutional layers.
     |      
     |      accuracy_min: float
     |          Optional parameter to allow alternative training strategy until minimum
     |          training accuracy is achieved, at which point, training ceases.
     |      
     |      train_loss_min: float
     |          Optional parameter to allow alternative training strategy until minimum
     |          training loss is achieved, at which point, training ceases.
     |      
     |      hinge_loss_t: float
     |          The per sample loss minimum at which the loss of that sample is not used
     |          to penalize the model anymore. In other words, once a per sample loss has hit
     |          this value, it gets set to 0.0.
     |      
     |      convergence: str
     |          This parameter determines which loss to assess the convergence criteria on.
     |          Options are 'validation' or 'training'. This is useful in the case one wants
     |          to change the convergence criteria on the training data when the training and validation
     |          partitions have been combined and used to training the model.
     |      
     |      num_fc_layers: int
     |          Number of fully connected layers following convolutional layer.
     |      
     |      units_fc: int
     |          Number of nodes per fully-connected layers following convolutional layer.
     |      
     |      drop_out_rate: float
     |          drop out rate for fully connected layers
     |      
     |      suppress_output: bool
     |          To suppress command line output with training statisitcs, set to True.
     |      
     |      use_only_gene: bool
     |          To only use gene-usage features, set to True. This will turn off features from
     |          the sequences.
     |      
     |      use_only_seq: bool
     |          To only use sequence feaures, set to True. This will turn off features learned
     |          from gene usage.
     |      
     |      use_only_hla: bool
     |          To only use hla feaures, set to True.
     |      
     |      size_of_net: list or str
     |          The convolutional layers of this network have 3 layers for which the use can
     |          modify the number of neurons per layer. The user can either specify the size of the network
     |          with the following options:
     |              - small == [12,32,64] neurons for the 3 respective layers
     |              - medium == [32,64,128] neurons for the 3 respective layers
     |              - large == [64,128,256] neurons for the 3 respective layers
     |              - custom, where the user supplies a list with the number of nuerons for the respective layers
     |                  i.e. [3,3,3] would have 3 neurons for all 3 layers.
     |      
     |      embedding_dim_aa: int
     |          Learned latent dimensionality of amino-acids.
     |      
     |      embedding_dim_genes: int
     |          Learned latent dimensionality of VDJ genes
     |      
     |      embedding_dim_hla: int
     |          Learned latent dimensionality of HLA
     |      
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from DeepTCR_S_base:
     |  
     |  AUC_Curve(self, by=None, filename='AUC.tif', title=None, plot=True)
     |      AUC Curve for both Sequence and Repertoire/Sample Classifiers
     |      
     |      Inputs
     |      ---------------------------------------
     |      by: str
     |          To show AUC curve for only one class, set this parameter
     |          to the name of the class label one wants to plot.
     |      
     |      filename: str
     |          Filename to save tif file of AUC curve.
     |      
     |      title: str
     |          Optional Title to put on ROC Curve.
     |      
     |      plot: bool
     |          To suppress plotting and just save the data/figure, set to False.
     |      
     |      Returns
     |      
     |      self.AUC_DF: Pandas Dataframe
     |          AUC scores are returned for each class.
     |      
     |      In addition to plotting the ROC Curve, the AUC's are saved
     |      to a csv file in the results directory called 'AUC.csv'
     |      
     |      ---------------------------------------
     |  
     |  Representative_Sequences(self, top_seq=10, motif_seq=5, unique=False)
     |      Identify most highly predicted sequences for each class and corresponding motifs.
     |      
     |      This method allows the user to query which sequences were most predicted to belong to a given class along
     |      with the motifs that were learned for these representative sequences.
     |      Of note, this method only reports sequences that were in the test set so as not to return highly predicted
     |      sequences that were over-fit in the training set. To obtain the highest predictd sequences in all the data,
     |      run a K-fold cross-validation or Monte-Carlo cross-validation before running this method. In this way,
     |      the predicted probability will have been assigned to a sequence only when it was in the independent test set.
     |      
     |      In the case of a regression task, the representative sequences for the 'high' and 'low' values for the regressio
     |      model are returned in the Rep_Seq Dict.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      top_seq: int
     |          The number of top sequences to show for each class.
     |      
     |      motif_seq: int
     |          The number of sequences to use to generate each motif. The more sequences, the possibly more noisy
     |          the seq_logo will be.
     |      
     |      unique: bool
     |          To only select for uniquely enriched motifs for a given class, set this parameter to True.
     |          Otherwise, this method will return the magnitude of enriched motifs of one class vs all other classes.
     |          To learn more specific/uniquely defining motifs, set this parameter to True at the expense of returning less
     |          motifs.
     |      
     |      Returns
     |      
     |      self.Rep_Seq: dictionary of dataframes
     |          This dictionary of dataframes holds for each class the top sequences and their respective
     |          probabiltiies for all classes. These dataframes can also be found in the results folder under Rep_Sequences.
     |      
     |      self.Rep_Seq_Features_(alpha/beta): dictionary of dataframes
     |          This dictionary of dataframes holds information for which features were uniquely enriched
     |          for each class.
     |      
     |      Furthermore, the motifs are written in the results directory underneath the Motifs folder. To find the beta
     |      motifs for a given class, look under Motifs/beta/class_name/. These fasta files are labeled by the magnitude
     |      enrichment of that given feature for that given class followed by the number name of the feature. These fasta files
     |      can then be visualized via weblogos at the following site: "https://weblogo.berkeley.edu/logo.cgi"
     |      
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from DeepTCR_base:
     |  
     |  Get_Data(self, directory, Load_Prev_Data=False, classes=None, type_of_data_cut='Fraction_Response', data_cut=1.0, n_jobs=40, aa_column_alpha=None, aa_column_beta=None, count_column=None, sep='\t', aggregate_by_aa=True, v_alpha_column=None, j_alpha_column=None, v_beta_column=None, j_beta_column=None, d_beta_column=None, p=None, hla=None)
     |      Get Data for DeepTCR
     |      
     |      Parse Data into appropriate inputs for neural network from directories where data is stored.
     |      
     |      Inputs
     |      ---------------------------------------
     |      directory: str
     |          Path to directory with folders with tsv files are present
     |          for analysis. Folders names become labels for files within them.
     |      
     |      Load_Prev_Data: bool
     |          Loads Previous Data.
     |      
     |      classes: list
     |          Optional selection of input of which sub-directories to use for analysis.
     |      
     |      
     |      type_of_data_cut: str
     |          Method by which one wants to sample from the TCRSeq File.
     |      
     |          Options are:
     |              Fraction_Response: A fraction (0 - 1) that samples the top fraction of the file by reads. For example,
     |              if one wants to sample the top 25% of reads, one would use this threshold with a data_cut = 0.25. The idea
     |              of this sampling is akin to sampling a fraction of cells from the file.
     |      
     |              Frequency_Cut: If one wants to select clones above a given frequency threshold, one would use this threshold.
     |              For example, if one wanted to only use clones about 1%, one would enter a data_cut value of 0.01.
     |      
     |              Num_Seq: If one wants to take the top N number of clones, one would use this threshold. For example,
     |              if one wanted to select the top 10 amino acid clones from each file, they would enter a data_cut value of 10.
     |      
     |              Read_Cut: If one wants to take amino acid clones with at least a certain number of reads, one would use
     |              this threshold. For example, if one wanted to only use clones with at least 10 reads,they would enter a data_cut value of 10.
     |      
     |              Read_Sum: IF one wants to take a given number of reads from each file, one would use this threshold. For example,
     |              if one wants to use the sequences comprising the top 100 reads of hte file, they would enter a data_cut value of 100.
     |      
     |      data_cut: float or int
     |          Value  associated with type_of_data_cut parameter.
     |      
     |      n_jobs: int
     |          Number of processes to use for parallelized operations.
     |      
     |      aa_column_alpha: int
     |          Column where alpha chain amino acid data is stored. (0-indexed)
     |      
     |      aa_column_beta: int
     |          Column where beta chain amino acid data is stored.(0-indexed)
     |      
     |      count_column: int
     |          Column where counts are stored.
     |      
     |      sep: str
     |          Type of delimiter used in file with TCRSeq data.
     |      
     |      aggregate_by_aa: bool
     |          Choose to aggregate sequences by unique amino-acid. Defaults to True. If set to False, will allow duplicates
     |          of the same amino acid sequence given it comes from different nucleotide clones.
     |      
     |      v_alpha_column: int
     |          Column where v_alpha gene information is stored.
     |      
     |      j_alpha_column: int
     |          Column where j_alpha gene information is stored.
     |      
     |      v_beta_column: int
     |          Column where v_beta gene information is stored.
     |      
     |      d_beta_column: int
     |          Column where d_beta gene information is stored.
     |      
     |      j_beta_column: int
     |          Column where j_beta gene information is stored.
     |      
     |      p: multiprocessing pool object
     |          For parellelized operations, one can pass a multiprocessing pool object
     |          to this method.
     |      
     |      hla: str
     |          In order to use HLA information as part of the TCR-seq representation, one can provide
     |          a csv file where the first column is the file name and the remaining columns hold HLA alleles
     |          for each file. By including HLA information for each repertoire being analyzed, one is able to
     |          find a representation of TCR-Seq that is more meaningful across repertoires with different HLA
     |          backgrounds.
     |      
     |      
     |      Returns
     |      
     |      self.alpha_sequences: ndarray
     |          array with alpha sequences (if provided)
     |      
     |      self.beta_sequences: ndarray
     |          array with beta sequences (if provided)
     |      
     |      self.class_id: ndarray
     |          array with sequence class labels
     |      
     |      self.sample_id: ndarray
     |          array with sequence file labels
     |      
     |      self.freq: ndarray
     |          array with sequence frequencies from samples
     |      
     |      self.counts: ndarray
     |          array with sequence counts from samples
     |      
     |      self.(v/d/j)_(alpha/beta): ndarray
     |          array with sequence (v/d/j)-(alpha/beta) usage
     |      
     |      ---------------------------------------
     |  
     |  Load_Data(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, class_labels=None, sample_labels=None, freq=None, counts=None, Y=None, p=None, hla=None, w=None)
     |      Load Data programatically into DeepTCR.
     |      
     |      DeepTCR allows direct user input of sequence data for DeepTCR analysis. By using this method,
     |      a user can load numpy arrays with relevant TCRSeq data for analysis.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      class_labels: ndarray of strings
     |          A 1d array with class labels for the sequence (i.e. antigen-specificities)
     |      
     |      sample_labels: ndarray of strings
     |          A 1d array with sample labels for the sequence. (i.e. when loading data from different samples)
     |      
     |      counts: ndarray of ints
     |          A 1d array with the counts for each sequence, in the case they come from samples.
     |      
     |      freq: ndarray of float values
     |          A 1d array with the frequencies for each sequence, in the case they come from samples.
     |      
     |      Y: ndarray of float values
     |          In the case one wants to regress TCR sequences against a numerical label, one can provide
     |          these numerical values for this input. As of latest release, regression is only available
     |          for sequence classifier.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      w: ndarray
     |          optional set of weights for training of autoencoder
     |      
     |      Returns
     |      
     |      self.alpha_sequences: ndarray
     |          array with alpha sequences (if provided)
     |      
     |      self.beta_sequences: ndarray
     |          array with beta sequences (if provided)
     |      
     |      self.label_id: ndarray
     |          array with sequence class labels
     |      
     |      self.file_id: ndarray
     |          array with sequence file labels
     |      
     |      self.freq: ndarray
     |          array with sequence frequencies from samples
     |      
     |      self.counts: ndarray
     |          array with sequence counts from samples
     |      
     |      self.(v/d/j)_(alpha/beta): ndarray
     |          array with sequence (v/d/j)-(alpha/beta) usage
     |      
     |      ---------------------------------------
     |  
     |  Sequence_Inference(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, p=None, hla=None, batch_size=10000)
     |      Predicting outputs of sequence models on new data
     |      
     |      This method allows a user to take a pre-trained autoencoder/sequence classifier
     |      and generate outputs from the model on new data. For the autoencoder, this returns
     |      the features from the latent space. For the sequence classifier, it is the probability
     |      of belonging to each class.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      batch_size: int
     |          Batch size for inference.
     |      
     |      Returns
     |      
     |      features: array
     |          An array that contains n x latent_dim containing features for all sequences
     |      
     |      ---------------------------------------
     |  
     |  __init__(self, Name, max_length=40, device='/device:GPU:0')
     |      Initialize Training Object.
     |      
     |      Initializes object and sets initial parameters.
     |      
     |      Inputs
     |      ---------------------------------------
     |      Name: str
     |          Name of the object.
     |      
     |      max_length: int
     |          maximum length of CDR3 sequence
     |      
     |      device: str
     |          In the case user is using tensorflow-gpu, one can
     |          specify the particular device to build the graphs on.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from DeepTCR_base:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from feature_analytics_class:
     |  
     |  Cluster(self, set='all', clustering_method='phenograph', t=None, criterion='distance', linkage_method='ward', write_to_sheets=False, sample=None, n_jobs=1, order_by_linkage=False)
     |      Clustering Sequences by Latent Features
     |      
     |      This method clusters all sequences by learned latent features from
     |      either the variational autoencoder Several clustering algorithms are included including
     |      Phenograph, DBSCAN, or hierarchical clustering. DBSCAN is implemented from the
     |      sklearn package. Hierarchical clustering is implemented from the scipy package.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      clustering_method: str
     |          Clustering algorithm to use to cluster TCR sequences. Options include
     |          phenograph, dbscan, or hierarchical. When using dbscan or hierarchical clustering,
     |          a variety of thresholds are used to find an optimimum silhoutte score before using a final
     |          clustering threshold when t value is not provided.
     |      
     |      t: float
     |          If t is provided, this is used as a distance threshold for hierarchical clustering or the eps
     |          value for dbscan.
     |      
     |      criterion: str
     |          Clustering criterion as allowed by fcluster function
     |          in scipy.cluster.hierarchy module. (Used in hierarchical clustering).
     |      
     |      linkage_method: str
     |          method parameter for linkage as allowed by scipy.cluster.hierarchy.linkage
     |      
     |      write_to_sheets: bool
     |          To write clusters to separate csv files in folder named 'Clusters' under results folder, set to True.
     |          Additionally, if set to True, a csv file will be written in results directory that contains the frequency contribution
     |          of each cluster to each sample.
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      order_by_linkage: bool
     |          To list sequences in the cluster dataframes by how they are related via ward's linakge,
     |          set this value to True. Otherwise, each cluster dataframe will list the sequences by the order they
     |          were loaded into DeepTCR.
     |      
     |      Returns
     |      
     |      self.Cluster_DFs: list of Pandas dataframes
     |          Clusters by sequences/label
     |      
     |      self.var: list
     |          Variance of lengths in each cluster
     |      
     |      self.Cluster_Frequencies: Pandas dataframe
     |          A dataframe containing the frequency contribution of each cluster to each sample.
     |      
     |      self.Cluster_Assignments: ndarray
     |          Array with cluster assignments by number.
     |      
     |      ---------------------------------------
     |  
     |  Motif_Identification(self, group, p_val_threshold=0.05, by_samples=False, top_seq=10)
     |      Motif Identification Supervised Classifiers
     |      
     |      This method looks for enriched features in the predetermined group
     |      and returns fasta files in directory to be used with "https://weblogo.berkeley.edu/logo.cgi"
     |      to produce seqlogos.
     |      
     |      Inputs
     |      ---------------------------------------
     |      group: string
     |          Class for analyzing enriched motifs.
     |      
     |      p_val_threshold: float
     |          Significance threshold for enriched features/motifs for
     |          Mann-Whitney UTest.
     |      
     |      by_samples: bool
     |          To run a motif identification that looks for enriched motifs at the sample
     |          instead of the seuence level, set this parameter to True. Otherwise, the enrichment
     |          analysis will be done at the sequence level.
     |      
     |      top_seq: int
     |          The number of sequences from which to derive the learned motifs. The larger the number,
     |          the more noisy the motif logo may be.
     |      
     |      Returns
     |      ---------------------------------------
     |      
     |      self.(alpha/beta)_group_features: Pandas Dataframe
     |          Sequences used to determine motifs in fasta files
     |          are stored in this dataframe where column names represent
     |          the feature number.
     |  
     |  Structural_Diversity(self, sample=None, n_jobs=1)
     |      Structural Diversity Measurements
     |      
     |      This method first clusters sequences via the phenograph algorithm before computing
     |      the number of clusters and entropy of the data over these clusters to obtain a measurement
     |      of the structural diversity within a repertoire.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      Returns
     |      
     |      self.Structural_Diversity_DF: Pandas dataframe
     |          A dataframe containing the number of clusters and entropy in each sample
     |      
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from vis_class:
     |  
     |  HeatMap_Samples(self, set='all', filename='Heatmap_Samples.tif', Weight_by_Freq=True, color_dict=None, labels=True, font_scale=1.0)
     |      HeatMap of Samples
     |      
     |      This method creates a heatmap/clustermap for samples by latent features
     |      for the unsupervised deep learning methods.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      filename: str
     |          Name of file to save heatmap.
     |      
     |      Weight_by_Freq: bool
     |          Option to weight each sequence used in aggregate measure
     |          of feature across sample by its frequency.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      labels: bool
     |          Option to show names of samples on y-axis of heatmap.
     |      
     |      font_scale: float
     |          This parameter controls the font size of the row labels. If there are many rows, one can make this value
     |          smaller to get better labeling of the rows.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  HeatMap_Sequences(self, set='all', filename='Heatmap_Sequences.tif', sample_num=None, sample_num_per_class=None, color_dict=None)
     |      HeatMap of Sequences
     |      
     |      This method creates a heatmap/clustermap for sequences by latent features
     |      for the unsupervised deep lerning methods.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      filename: str
     |          Name of file to save heatmap.
     |      
     |      sample_num: int
     |          Number of events to randomly sample for heatmap.
     |      
     |      sample_num_per_class: int
     |          Number of events to randomly sample per class for heatmap.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  Repertoire_Dendrogram(self, set='all', distance_metric='KL', sample=None, n_jobs=1, color_dict=None, dendrogram_radius=0.32, repertoire_radius=0.4, linkage_method='ward', gridsize=24, Load_Prev_Data=False, filename=None, sample_labels=False, gaussian_sigma=0.5, vmax=0.01, n_pad=5, lw=None, log_scale=False)
     |      Repertoire Dendrogram
     |      
     |      This method creates a visualization that shows and compares the distribution
     |      of the sample repertoires via UMAP and provided distance metric. The underlying
     |      algorithm first applied phenograph clustering to determine the proportions of the sample
     |      within a given cluster. Then a distance metric is used to compare how far two samples are
     |      based on their cluster proportions. Various metrics can be provided here such as KL-divergence,
     |      Correlation, and Euclidean.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      distance_metric = str
     |          Provided distance metric to determine repertoire-level distance from cluster proportions.
     |          Options include = (KL,correlation,euclidean,wasserstein,JS).
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      dendrogram_radius: float
     |          The radius of the dendrogram in the figure. This will usually require some adjustment
     |          given the number of samples.
     |      
     |      repertoire_radius: float
     |          The radius of the repertoire plots in the figure. This will usually require some adjustment
     |          given the number of samples.
     |      
     |      linkage_method: str
     |          linkage method used by scipy's linkage function
     |      
     |      gridsize: int
     |          This parameter modifies the granularity of the hexbins for the repertoire density plots.
     |      
     |      Load_Prev_Data: bool
     |          If method has been run before, one can load previous data used to construct the figure for
     |          faster figure creation. This is helpful when trying to format the figure correctly and will require
     |          the user to run the method multiple times.
     |      
     |      filename: str
     |          To save dendrogram plot to results folder, enter a name for the file and the dendrogram
     |          will be saved to the results directory.
     |          i.e. dendrogram.png
     |      
     |      sample_labels: bool
     |          To show the sample labels on the dendrogram, set to True.
     |      
     |      gaussian_sigma: float
     |          The amount of blur to introduce in the plots.
     |      
     |      vmax: float
     |          Highest color density value. Color scales from 0 to vmax (i.e. larger vmax == dimmer plot)
     |      
     |      lw: float
     |          The width of the circle edge around each sample.
     |      
     |      log_scale: bool
     |          To plot the log of the counts for the UMAP density plot, set this value to True. This can be
     |          particularly helpful for visualization if the populations are very clonal.
     |      
     |      Returns
     |      
     |      self.pairwise_distances: Pandas dataframe
     |          Pairwise distances of all samples
     |      ---------------------------------------
     |  
     |  UMAP_Plot(self, set='all', by_class=False, by_cluster=False, by_sample=False, freq_weight=False, show_legend=True, scale=100, Load_Prev_Data=False, alpha=1.0, sample=None, filename=None, prob_plot=None)
     |      UMAP vizualisation of TCR Sequences
     |      
     |      This method displays the sequences in a 2-dimensional UMAP where the user can color code points by
     |      class label, sample label, or prior computing clustering solution. Size of points can also be made to be proportional
     |      to frequency of sequence within sample.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      by_class: bool
     |          To color the points by their class label, set to True.
     |      
     |      by_sample: bool
     |          To color the points by their sample lebel, set to True.
     |      
     |      by_cluster:bool
     |          To color the points by the prior computed clustering solution, set to True.
     |      
     |      freq_weight: bool
     |          To scale size of points proportionally to their frequency, set to True.
     |      
     |      show_legend: bool
     |          To display legend, set to True.
     |      
     |      scale: float
     |          To change size of points, change scale parameter. Is particularly useful
     |          when finding good display size when points are scaled by frequency.
     |      
     |      Load_Prev_Data: bool
     |          If method was run before, one can rerun this method with this parameter set
     |          to True to bypass recomputing the UMAP projection. Useful for generating
     |          different versions of the plot on the same UMAP representation.
     |      
     |      alpha: float
     |          Value between 0-1 that controls transparency of points.
     |      
     |      sample: int
     |          Number of events to sub-sample for visualization.
     |      
     |      filename: str
     |          To save umap plot to results folder, enter a name for the file and the umap
     |          will be saved to the results directory.
     |          i.e. umap.png
     |      
     |      prob_plot: str
     |          To plot the predicted probabilities for the sequences as an additional heatmap, specify
     |          the class probability one wants to visualize (i.e. if the class of interest is class A, input
     |          'A' as a string). Of note, only probabilities determined from the sequences in the test set are
     |          displayed as a means of not showing over-fit probabilities. Therefore, it is best to use this parameter
     |          when the set parameter is turned to 'test'.
     |      
     |      
     |      Returns
     |      
     |      ---------------------------------------
    
    class DeepTCR_base(builtins.object)
     |  Methods defined here:
     |  
     |  Get_Data(self, directory, Load_Prev_Data=False, classes=None, type_of_data_cut='Fraction_Response', data_cut=1.0, n_jobs=40, aa_column_alpha=None, aa_column_beta=None, count_column=None, sep='\t', aggregate_by_aa=True, v_alpha_column=None, j_alpha_column=None, v_beta_column=None, j_beta_column=None, d_beta_column=None, p=None, hla=None)
     |      Get Data for DeepTCR
     |      
     |      Parse Data into appropriate inputs for neural network from directories where data is stored.
     |      
     |      Inputs
     |      ---------------------------------------
     |      directory: str
     |          Path to directory with folders with tsv files are present
     |          for analysis. Folders names become labels for files within them.
     |      
     |      Load_Prev_Data: bool
     |          Loads Previous Data.
     |      
     |      classes: list
     |          Optional selection of input of which sub-directories to use for analysis.
     |      
     |      
     |      type_of_data_cut: str
     |          Method by which one wants to sample from the TCRSeq File.
     |      
     |          Options are:
     |              Fraction_Response: A fraction (0 - 1) that samples the top fraction of the file by reads. For example,
     |              if one wants to sample the top 25% of reads, one would use this threshold with a data_cut = 0.25. The idea
     |              of this sampling is akin to sampling a fraction of cells from the file.
     |      
     |              Frequency_Cut: If one wants to select clones above a given frequency threshold, one would use this threshold.
     |              For example, if one wanted to only use clones about 1%, one would enter a data_cut value of 0.01.
     |      
     |              Num_Seq: If one wants to take the top N number of clones, one would use this threshold. For example,
     |              if one wanted to select the top 10 amino acid clones from each file, they would enter a data_cut value of 10.
     |      
     |              Read_Cut: If one wants to take amino acid clones with at least a certain number of reads, one would use
     |              this threshold. For example, if one wanted to only use clones with at least 10 reads,they would enter a data_cut value of 10.
     |      
     |              Read_Sum: IF one wants to take a given number of reads from each file, one would use this threshold. For example,
     |              if one wants to use the sequences comprising the top 100 reads of hte file, they would enter a data_cut value of 100.
     |      
     |      data_cut: float or int
     |          Value  associated with type_of_data_cut parameter.
     |      
     |      n_jobs: int
     |          Number of processes to use for parallelized operations.
     |      
     |      aa_column_alpha: int
     |          Column where alpha chain amino acid data is stored. (0-indexed)
     |      
     |      aa_column_beta: int
     |          Column where beta chain amino acid data is stored.(0-indexed)
     |      
     |      count_column: int
     |          Column where counts are stored.
     |      
     |      sep: str
     |          Type of delimiter used in file with TCRSeq data.
     |      
     |      aggregate_by_aa: bool
     |          Choose to aggregate sequences by unique amino-acid. Defaults to True. If set to False, will allow duplicates
     |          of the same amino acid sequence given it comes from different nucleotide clones.
     |      
     |      v_alpha_column: int
     |          Column where v_alpha gene information is stored.
     |      
     |      j_alpha_column: int
     |          Column where j_alpha gene information is stored.
     |      
     |      v_beta_column: int
     |          Column where v_beta gene information is stored.
     |      
     |      d_beta_column: int
     |          Column where d_beta gene information is stored.
     |      
     |      j_beta_column: int
     |          Column where j_beta gene information is stored.
     |      
     |      p: multiprocessing pool object
     |          For parellelized operations, one can pass a multiprocessing pool object
     |          to this method.
     |      
     |      hla: str
     |          In order to use HLA information as part of the TCR-seq representation, one can provide
     |          a csv file where the first column is the file name and the remaining columns hold HLA alleles
     |          for each file. By including HLA information for each repertoire being analyzed, one is able to
     |          find a representation of TCR-Seq that is more meaningful across repertoires with different HLA
     |          backgrounds.
     |      
     |      
     |      Returns
     |      
     |      self.alpha_sequences: ndarray
     |          array with alpha sequences (if provided)
     |      
     |      self.beta_sequences: ndarray
     |          array with beta sequences (if provided)
     |      
     |      self.class_id: ndarray
     |          array with sequence class labels
     |      
     |      self.sample_id: ndarray
     |          array with sequence file labels
     |      
     |      self.freq: ndarray
     |          array with sequence frequencies from samples
     |      
     |      self.counts: ndarray
     |          array with sequence counts from samples
     |      
     |      self.(v/d/j)_(alpha/beta): ndarray
     |          array with sequence (v/d/j)-(alpha/beta) usage
     |      
     |      ---------------------------------------
     |  
     |  Load_Data(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, class_labels=None, sample_labels=None, freq=None, counts=None, Y=None, p=None, hla=None, w=None)
     |      Load Data programatically into DeepTCR.
     |      
     |      DeepTCR allows direct user input of sequence data for DeepTCR analysis. By using this method,
     |      a user can load numpy arrays with relevant TCRSeq data for analysis.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      class_labels: ndarray of strings
     |          A 1d array with class labels for the sequence (i.e. antigen-specificities)
     |      
     |      sample_labels: ndarray of strings
     |          A 1d array with sample labels for the sequence. (i.e. when loading data from different samples)
     |      
     |      counts: ndarray of ints
     |          A 1d array with the counts for each sequence, in the case they come from samples.
     |      
     |      freq: ndarray of float values
     |          A 1d array with the frequencies for each sequence, in the case they come from samples.
     |      
     |      Y: ndarray of float values
     |          In the case one wants to regress TCR sequences against a numerical label, one can provide
     |          these numerical values for this input. As of latest release, regression is only available
     |          for sequence classifier.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      w: ndarray
     |          optional set of weights for training of autoencoder
     |      
     |      Returns
     |      
     |      self.alpha_sequences: ndarray
     |          array with alpha sequences (if provided)
     |      
     |      self.beta_sequences: ndarray
     |          array with beta sequences (if provided)
     |      
     |      self.label_id: ndarray
     |          array with sequence class labels
     |      
     |      self.file_id: ndarray
     |          array with sequence file labels
     |      
     |      self.freq: ndarray
     |          array with sequence frequencies from samples
     |      
     |      self.counts: ndarray
     |          array with sequence counts from samples
     |      
     |      self.(v/d/j)_(alpha/beta): ndarray
     |          array with sequence (v/d/j)-(alpha/beta) usage
     |      
     |      ---------------------------------------
     |  
     |  Sequence_Inference(self, alpha_sequences=None, beta_sequences=None, v_beta=None, d_beta=None, j_beta=None, v_alpha=None, j_alpha=None, p=None, hla=None, batch_size=10000)
     |      Predicting outputs of sequence models on new data
     |      
     |      This method allows a user to take a pre-trained autoencoder/sequence classifier
     |      and generate outputs from the model on new data. For the autoencoder, this returns
     |      the features from the latent space. For the sequence classifier, it is the probability
     |      of belonging to each class.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      alpha_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the alpha chain.
     |      
     |      beta_sequences: ndarray of strings
     |          A 1d array with the sequences for inference for the beta chain.
     |      
     |      v_beta: ndarray of strings
     |          A 1d array with the v-beta genes for inference.
     |      
     |      d_beta: ndarray of strings
     |          A 1d array with the d-beta genes for inference.
     |      
     |      j_beta: ndarray of strings
     |          A 1d array with the j-beta genes for inference.
     |      
     |      v_alpha: ndarray of strings
     |          A 1d array with the v-alpha genes for inference.
     |      
     |      j_alpha: ndarray of strings
     |          A 1d array with the j-alpha genes for inference.
     |      
     |      hla: ndarray of tuples
     |          To input the hla context for each sequence fed into DeepTCR, this will need to formatted
     |          as an ndarray that is (N,) for each sequence where each entry is a tuple of strings referring
     |          to the alleles seen for that sequence.
     |              ('A*01:01', 'A*11:01', 'B*35:01', 'B*35:02', 'C*04:01')
     |      
     |      p: multiprocessing pool object
     |          a pre-formed pool object can be passed to method for multiprocessing tasks.
     |      
     |      batch_size: int
     |          Batch size for inference.
     |      
     |      Returns
     |      
     |      features: array
     |          An array that contains n x latent_dim containing features for all sequences
     |      
     |      ---------------------------------------
     |  
     |  __init__(self, Name, max_length=40, device='/device:GPU:0')
     |      Initialize Training Object.
     |      
     |      Initializes object and sets initial parameters.
     |      
     |      Inputs
     |      ---------------------------------------
     |      Name: str
     |          Name of the object.
     |      
     |      max_length: int
     |          maximum length of CDR3 sequence
     |      
     |      device: str
     |          In the case user is using tensorflow-gpu, one can
     |          specify the particular device to build the graphs on.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class feature_analytics_class(builtins.object)
     |  Methods defined here:
     |  
     |  Cluster(self, set='all', clustering_method='phenograph', t=None, criterion='distance', linkage_method='ward', write_to_sheets=False, sample=None, n_jobs=1, order_by_linkage=False)
     |      Clustering Sequences by Latent Features
     |      
     |      This method clusters all sequences by learned latent features from
     |      either the variational autoencoder Several clustering algorithms are included including
     |      Phenograph, DBSCAN, or hierarchical clustering. DBSCAN is implemented from the
     |      sklearn package. Hierarchical clustering is implemented from the scipy package.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      clustering_method: str
     |          Clustering algorithm to use to cluster TCR sequences. Options include
     |          phenograph, dbscan, or hierarchical. When using dbscan or hierarchical clustering,
     |          a variety of thresholds are used to find an optimimum silhoutte score before using a final
     |          clustering threshold when t value is not provided.
     |      
     |      t: float
     |          If t is provided, this is used as a distance threshold for hierarchical clustering or the eps
     |          value for dbscan.
     |      
     |      criterion: str
     |          Clustering criterion as allowed by fcluster function
     |          in scipy.cluster.hierarchy module. (Used in hierarchical clustering).
     |      
     |      linkage_method: str
     |          method parameter for linkage as allowed by scipy.cluster.hierarchy.linkage
     |      
     |      write_to_sheets: bool
     |          To write clusters to separate csv files in folder named 'Clusters' under results folder, set to True.
     |          Additionally, if set to True, a csv file will be written in results directory that contains the frequency contribution
     |          of each cluster to each sample.
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      order_by_linkage: bool
     |          To list sequences in the cluster dataframes by how they are related via ward's linakge,
     |          set this value to True. Otherwise, each cluster dataframe will list the sequences by the order they
     |          were loaded into DeepTCR.
     |      
     |      Returns
     |      
     |      self.Cluster_DFs: list of Pandas dataframes
     |          Clusters by sequences/label
     |      
     |      self.var: list
     |          Variance of lengths in each cluster
     |      
     |      self.Cluster_Frequencies: Pandas dataframe
     |          A dataframe containing the frequency contribution of each cluster to each sample.
     |      
     |      self.Cluster_Assignments: ndarray
     |          Array with cluster assignments by number.
     |      
     |      ---------------------------------------
     |  
     |  Motif_Identification(self, group, p_val_threshold=0.05, by_samples=False, top_seq=10)
     |      Motif Identification Supervised Classifiers
     |      
     |      This method looks for enriched features in the predetermined group
     |      and returns fasta files in directory to be used with "https://weblogo.berkeley.edu/logo.cgi"
     |      to produce seqlogos.
     |      
     |      Inputs
     |      ---------------------------------------
     |      group: string
     |          Class for analyzing enriched motifs.
     |      
     |      p_val_threshold: float
     |          Significance threshold for enriched features/motifs for
     |          Mann-Whitney UTest.
     |      
     |      by_samples: bool
     |          To run a motif identification that looks for enriched motifs at the sample
     |          instead of the seuence level, set this parameter to True. Otherwise, the enrichment
     |          analysis will be done at the sequence level.
     |      
     |      top_seq: int
     |          The number of sequences from which to derive the learned motifs. The larger the number,
     |          the more noisy the motif logo may be.
     |      
     |      Returns
     |      ---------------------------------------
     |      
     |      self.(alpha/beta)_group_features: Pandas Dataframe
     |          Sequences used to determine motifs in fasta files
     |          are stored in this dataframe where column names represent
     |          the feature number.
     |  
     |  Structural_Diversity(self, sample=None, n_jobs=1)
     |      Structural Diversity Measurements
     |      
     |      This method first clusters sequences via the phenograph algorithm before computing
     |      the number of clusters and entropy of the data over these clusters to obtain a measurement
     |      of the structural diversity within a repertoire.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      Returns
     |      
     |      self.Structural_Diversity_DF: Pandas dataframe
     |          A dataframe containing the number of clusters and entropy in each sample
     |      
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class vis_class(builtins.object)
     |  Methods defined here:
     |  
     |  HeatMap_Samples(self, set='all', filename='Heatmap_Samples.tif', Weight_by_Freq=True, color_dict=None, labels=True, font_scale=1.0)
     |      HeatMap of Samples
     |      
     |      This method creates a heatmap/clustermap for samples by latent features
     |      for the unsupervised deep learning methods.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      filename: str
     |          Name of file to save heatmap.
     |      
     |      Weight_by_Freq: bool
     |          Option to weight each sequence used in aggregate measure
     |          of feature across sample by its frequency.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      labels: bool
     |          Option to show names of samples on y-axis of heatmap.
     |      
     |      font_scale: float
     |          This parameter controls the font size of the row labels. If there are many rows, one can make this value
     |          smaller to get better labeling of the rows.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  HeatMap_Sequences(self, set='all', filename='Heatmap_Sequences.tif', sample_num=None, sample_num_per_class=None, color_dict=None)
     |      HeatMap of Sequences
     |      
     |      This method creates a heatmap/clustermap for sequences by latent features
     |      for the unsupervised deep lerning methods.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      filename: str
     |          Name of file to save heatmap.
     |      
     |      sample_num: int
     |          Number of events to randomly sample for heatmap.
     |      
     |      sample_num_per_class: int
     |          Number of events to randomly sample per class for heatmap.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      Returns
     |      ---------------------------------------
     |  
     |  Repertoire_Dendrogram(self, set='all', distance_metric='KL', sample=None, n_jobs=1, color_dict=None, dendrogram_radius=0.32, repertoire_radius=0.4, linkage_method='ward', gridsize=24, Load_Prev_Data=False, filename=None, sample_labels=False, gaussian_sigma=0.5, vmax=0.01, n_pad=5, lw=None, log_scale=False)
     |      Repertoire Dendrogram
     |      
     |      This method creates a visualization that shows and compares the distribution
     |      of the sample repertoires via UMAP and provided distance metric. The underlying
     |      algorithm first applied phenograph clustering to determine the proportions of the sample
     |      within a given cluster. Then a distance metric is used to compare how far two samples are
     |      based on their cluster proportions. Various metrics can be provided here such as KL-divergence,
     |      Correlation, and Euclidean.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      distance_metric = str
     |          Provided distance metric to determine repertoire-level distance from cluster proportions.
     |          Options include = (KL,correlation,euclidean,wasserstein,JS).
     |      
     |      sample: int
     |          For large numbers of sequences, to obtain a faster clustering solution, one can sub-sample
     |          a number of sequences and then use k-nearest neighbors to assign other sequences.
     |      
     |      n_jobs:int
     |          Number of processes to use for parallel operations.
     |      
     |      color_dict: dict
     |          Optional dictionary to provide specified colors for classes.
     |      
     |      dendrogram_radius: float
     |          The radius of the dendrogram in the figure. This will usually require some adjustment
     |          given the number of samples.
     |      
     |      repertoire_radius: float
     |          The radius of the repertoire plots in the figure. This will usually require some adjustment
     |          given the number of samples.
     |      
     |      linkage_method: str
     |          linkage method used by scipy's linkage function
     |      
     |      gridsize: int
     |          This parameter modifies the granularity of the hexbins for the repertoire density plots.
     |      
     |      Load_Prev_Data: bool
     |          If method has been run before, one can load previous data used to construct the figure for
     |          faster figure creation. This is helpful when trying to format the figure correctly and will require
     |          the user to run the method multiple times.
     |      
     |      filename: str
     |          To save dendrogram plot to results folder, enter a name for the file and the dendrogram
     |          will be saved to the results directory.
     |          i.e. dendrogram.png
     |      
     |      sample_labels: bool
     |          To show the sample labels on the dendrogram, set to True.
     |      
     |      gaussian_sigma: float
     |          The amount of blur to introduce in the plots.
     |      
     |      vmax: float
     |          Highest color density value. Color scales from 0 to vmax (i.e. larger vmax == dimmer plot)
     |      
     |      lw: float
     |          The width of the circle edge around each sample.
     |      
     |      log_scale: bool
     |          To plot the log of the counts for the UMAP density plot, set this value to True. This can be
     |          particularly helpful for visualization if the populations are very clonal.
     |      
     |      Returns
     |      
     |      self.pairwise_distances: Pandas dataframe
     |          Pairwise distances of all samples
     |      ---------------------------------------
     |  
     |  UMAP_Plot(self, set='all', by_class=False, by_cluster=False, by_sample=False, freq_weight=False, show_legend=True, scale=100, Load_Prev_Data=False, alpha=1.0, sample=None, filename=None, prob_plot=None)
     |      UMAP vizualisation of TCR Sequences
     |      
     |      This method displays the sequences in a 2-dimensional UMAP where the user can color code points by
     |      class label, sample label, or prior computing clustering solution. Size of points can also be made to be proportional
     |      to frequency of sequence within sample.
     |      
     |      Inputs
     |      ---------------------------------------
     |      
     |      set: str
     |          To choose which set of sequences to analye, enter either
     |          'all','train', 'valid',or 'test'. Since the sequences in the train set
     |          may be overfit, it preferable to generally examine the test set on its own.
     |      
     |      by_class: bool
     |          To color the points by their class label, set to True.
     |      
     |      by_sample: bool
     |          To color the points by their sample lebel, set to True.
     |      
     |      by_cluster:bool
     |          To color the points by the prior computed clustering solution, set to True.
     |      
     |      freq_weight: bool
     |          To scale size of points proportionally to their frequency, set to True.
     |      
     |      show_legend: bool
     |          To display legend, set to True.
     |      
     |      scale: float
     |          To change size of points, change scale parameter. Is particularly useful
     |          when finding good display size when points are scaled by frequency.
     |      
     |      Load_Prev_Data: bool
     |          If method was run before, one can rerun this method with this parameter set
     |          to True to bypass recomputing the UMAP projection. Useful for generating
     |          different versions of the plot on the same UMAP representation.
     |      
     |      alpha: float
     |          Value between 0-1 that controls transparency of points.
     |      
     |      sample: int
     |          Number of events to sub-sample for visualization.
     |      
     |      filename: str
     |          To save umap plot to results folder, enter a name for the file and the umap
     |          will be saved to the results directory.
     |          i.e. umap.png
     |      
     |      prob_plot: str
     |          To plot the predicted probabilities for the sequences as an additional heatmap, specify
     |          the class probability one wants to visualize (i.e. if the class of interest is class A, input
     |          'A' as a string). Of note, only probabilities determined from the sequences in the test set are
     |          displayed as a means of not showing over-fit probabilities. Therefore, it is best to use this parameter
     |          when the set parameter is turned to 'test'.
     |      
     |      
     |      Returns
     |      
     |      ---------------------------------------
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)

FILE
    /home/jsidhom1/DeepTCR/DeepTCR/DeepTCR.py


