Metadata-Version: 2.1
Name: CoMBCR
Version: 0.1.3
Summary: A python lib for CoMBCR
Home-page: https://github.com/deepomicslab/CoMBCR.git/
Author: Yiping Zou
Author-email: yipingzou2-c@my.cityu.edu.hk
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE.md

# CoMBCR
## Introduction
CoMBCR is an innovative B-cell embedding method designed to integrate multi-modal data from B cells, particularly BCRs and gene expressions, within a co-learning framework. By accepting paired BCR sequences and gene expression profiles as input, CoMBCR effectively integrates these two modalities to produce joint representations for each B cell, focusing specifically on the heavy chain of BCRs. 
## Prerequisites
CoMBCR is implemented in Python and requires a GPU for the acceleration. 

We recommend the versions of the following packages:  
- Pytorch (2.4.1)  
- Transformers (4.41.2)  
- Numpy (1.26.4)  
- Pandas (2.2.3)  
- Scikit-learn (1.5.1)  
- huggingface_hub by ```python3 -m pip install huggingface_hub```

## Installation
Install CoMBCR using pip:

```
pip3 install CoMBCR
```
Then, install the default pre-trained encoder (The code only need to be executed once when install CoMBCR):
```
from CoMBCR.utils import download_BCRencoder
download_BCRencoder()
```
## Tutorial
We provide a tutorial for the usage of CoMBCR.
## Usage
> ### Prepare input data
> CoMBCR integrates BCRs and gene expressions but requires three files: a BCR sequences file, a gene expression file, and a file containing BCR embeddings generated by a BCR encoder (e.g., AntiBERTa, ESM2).  
> - Ensure each file includes an index column labeled "barcode," serving as a unique identifier for each cell.   
> - Verify that the cells are aligned in the same order across all three files.
>> #### BCR sequences file
>> This CSV file should include an index column named "barcode" and columns labeled "cdr1," "cdr2," "fwr2," "cdr3," "fwr3," and "fwr4." The file should resemble the example shown below: ![](images/BCRs.png)
>> #### Gene expression file
>> Normalization and log-transformation are recommended. Batch effect removal is advisable if applicable. We suggest using the top 5,000 highly variable genes, though you can select input genes according to your criteria.
>> #### Original BCR embeddings file
>> This file is used to measure the original distances between BCRs. We recommend using our default pre-trained encoder, though any encoder can be used to encode BCRs. 
>> ```
>> from CoMBCR.runberta import generate_original_bcremb
>> berta_emb = generate_original_bcremb("exampledata/example_bcr.csv",
>>                                      outdir = "example_outdir")
>> ```
>> The code generates an original BCR embedding file named "antiberta_embedding.csv" under the outdir.
> ### Quick run
>> To quickly run CoMBCR, use the following code:  
>> ```
>> from CoMBCR.CoMBCR import CoMBCR_main
>> bcremb, gexemb = CoMBCR_main(bcrpath="exampledata/example_bcr.csv", 
>>            rnapath="exampledata/example_rna.csv", 
>>            bcroriginal="exampledata/example_bcrori.csv", 
>>            outdir="example_outdir",
>>            epochs=1,
>>            batch_size=32,
>>            encoderprofile_in_dim=5000)
>> ```
>> This code returns numpy arrays for BCR embeddings and gene expression embeddings, and outputs "bcrembedding.csv" and "gexembedding.csv" in the specified output directory.
> ### Parameters of CoMBCR
>> | Parameter | Description |
>> | ------------- | ------------- |
>> | **bcrpath** | (Required) The path to the BCR sequences file.|
>> | **rnapath** | (Required) The path to the gene expression file.|
>> | **bcroriginal**| (Required) The path to the BCR original embedding file.|
>> |**outdir**|(Required) The directory where the best checkpoint file and the output embeddings will be stored.|
>> |**checkpoint**|Default is "best_network.pth". This parameter specifies the name of the file where the best model checkpoint will be saved.|
>> |**lr**|Default is 1e-5.|
>> |**lam**| Default is 1e-1, the inner parameter (Parameter alpha in the paper).|
>> |**batch_size** | Default is 256.|
>> |**epochs** | Default is 200.|
>> |**patience**| Default is 15, the patience for early stopping.|
>> |**lr_step** | Default is [30,100]. These are the milestones for the MultiStepLR setting, which adjusts the learning rate at specified epochs.|
>> |**encoderprofile_in_dim**| Default is 5000. Adjust this parameter if the number of input genes differs from 5000.|
>> |**separatebatch**|The default is False. If set to True, BCRs from different samples will be treated as distinct BCRs. Ensure that your BCR input file contains a "sample" column if you choose to enable this option. |

## Acknowledgements
The code was based in part on the source code of [UniTCR](https://github.com/bm2-lab/UniTCR/tree/main).
## Questions
If you encounter issues installing or using CoMBCR, please feel free to open a issue or contact me via [email](yipingzou2-c@my.cityu.edu.hk).

