Metadata-Version: 2.1
Name: blackbirdCoOp
Version: 0.1.6
Summary: A Stealth-based pipeline that optimizes inserts for cyanobacterial transformations in non-model organisms.
Home-page: https://gitlab.igem.org/2024/software-tools/ucsc
License: MIT
Author: Vibhitha Nandakumar
Author-email: vinandak@ucsc.edu
Requires-Python: >=3.10,<3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Project-URL: Repository, https://gitlab.igem.org/2024/software-tools/ucsc
Description-Content-Type: text/markdown

# BLACKBIRDCoOp

BLACKBIRD or BlackbirdCoOp is the software element for the 2024 UCSC iGem team, LIFT. BLACKBIRD is a software package that is able to optimize insert sequences for non-model cyanobacteria. 

## Description

BLACKBIRD is the software element of the 2024 UCSC iGem team, LIFT. 

BLACKBIRD is built on Stealth, a bioinformatics tool developed by our PI, David L. Bernick, at UCSC. Stealth identifies and reports statistically underrepresented k-mer motifs within a genome in order to identify potential restriction enzyme cut-sites within an insert's coding region. For software related information about Stealth, please refer to this [repository](https://git.ucsc.edu/dbernick/stealth).

BLACKBIRD is a versatile program that uses the genomes of a host organism, the origin of the gene insert, and the genome of a target organism to optimize the gene insert which would be free of RM cut sites.

*This is an alpha version which is a pre-release version whos beta aims to produce an optimized insert sequence that is even more efficient in eradicating the maximum number of stealth hits. Any BLACKBIRD results that have been produced and integrated into the project are a result of version 0 of BLACKBIRD.*

For more information about the project and its goals, please refer to our [team wiki](https://2024.igem.wiki/ucsc/software)

## Installation

#### Requirements
**For Unix/macOS:**

First, check if your system can run python and the pip installer. Python packages that are not downloaded to your system need to be retrived by an installer like pip. Use the following prompts to check:
```bash
usr:~$ python --version
Python 3.x.x
usr:~$ python -m pip --version
pip X.Y.Z from /<path>/<to>/<your>/pip (python 3.x.x)  
```

If you receive an error...


Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew.
However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing
specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a
specific context like a particular programming language version or operating system or has dependencies that have to be
installed manually, also add a Requirements subsection.

## Usage

#### BLACKBIRD CLI
Once installed, the main function can be easily run with the command `blackbird`
```bash
# usage
blackbird --insert (-n) <insert infile> --stealth (-s) <stealth infile> --hostT (-ht) <host genome infile> --target (-t) <target genome infile> --outfile -o [outfile | default: stdout]
```

The `blackbird` command takes 4 required arguments `--insert (-n)`, `--stealth (-s))`, `--hostT (-ht)`, and `--target (-t)`. 
`--insert (-n)` is the insert sequence of interest in Fasta format (.fa/.fasta)
`--stealth (-s))` is the list of Stealth outputted kmers in a text file (.txt/.stealth)
`--hostT (-ht)` is the host organism's codon usage table in TSV format (.tsv)
`--target (-t)` is the target organism's complete genome in Fasta format (.fa/.fasta)

An example of a insert sequence in Fasta format is as follows:
```bash
>pET28:EGFP CDS
ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA
```
A similar format can be applied to all input files in Fasta

An example of the stealth input file is as follows:
```bash
N = 3081514
CGCG	[100]	RC Palindrome
GCGC	[98]	RC Palindrome
GGCC	[100]	RC Palindrome
AATAG	[92]	
AATCG	[100] ...

GAAGAC
GTCTTC
GGTCTC
GAGACC
```
The sequences starting on the second line are the generated under-represented k-mers. By default, the k-mers will be within the range of 4-8 nucleotides. 'RC Palindrome' refers to the occurence of palindromic under-represented sequences. The numbers in the bracket is usually higher than the thresholds/cut-off values set by the user. (Note: the version of stealth that handles bootstrapping will not be added to this repository at this time) 

The domestication protocol is also a very simple procedure. Simply add the known internal Type IIS restriction enzyme sites for Golden Gate Assembly at the end of the stealth file.

An example of an organism's codon usage table in a tsv file is as follow:
```bash
TTT	22.31	-2414783
TTC	16.54	-1789835
TTA	13.76	-1489606
TTG	13.65	-1477363 ...
```
In this version, BLACKBIRD considerd the second values on each lines as the 'thousandths' value or the relative codon bias or each indicated codon

An example of the output file format is in a Fasta file as follows:
```bash
>pET28:EGFP CDS output [8]
ATGTCAATATATCAA...
```
The number in the brackets refers to the current number of stealth hits of the outputted insert sequence

### Alpha version - Tentative results:

The team is currently and will continue to better the algorithm in order to bring the number of stealth hits down to a minimum (ideally 0). All previous gene blocks corresponding to the most optimized results that the team has already been physically working with in order to attempt to demonstrate how the transformation efficiency was improved, were all based on the previous version. 

For the sake of demonstration purposes of the previous version, we have included documentation of one of our many target organisms at the time, PCC 11901's BLACKBIRD results (results which were already utilized to order gene blocks). This includes a complete genome file (as provided by [NCBI](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2579791)), a GFP insert file (provided by [Addgene](https://www.addgene.org/browse/sequence/392768/)), an example of an open source [codon usage table of E. Coli](https://dnahive.fda.gov/dna.cgi?cmd=codon_usage&id=537&mode=cocoputs) (a recombinant host organism that is widely used to produce GFP) and a Stealth file containing a list of "stealth sites" or under-represented sites based on our PI David L. Bernick's software. 

As a result, there will also be documentation of an example of a simple Fasta file output containing the most optimized GFP sequence for the strain PCC11901.

## Contributing

The LIFT, the 2024 UCSC iGem team consents to receiving any and all contributions offered. 

This software is published under the MIT license. Feel free to use any and all code provided by the project in any way and for any purpose.


## Authors and acknowledgment
BLACKBIRD was written and contributed to by 
* Vibhitha Nandakumar (email: vinandak@ucsc.edu)
* Aurko Mahesh (email: amahesh@ucsc.edu)

Special thanks 
* David L. Bernick (email: dbernick@soe.ucsc.edu), our PI, allowing the further application of Stealth and for all the support and contributions throughout. 
* Robin Rounthwaite (email: rrounthw@ucsc.edu) for consultance in software architecture and Git repository management
* TABI 2023 UCSC iGem team ([github](https://gitlab.igem.org/2023/software-tools/ucsc)) for support regarding Git repository management and project packaging
* Reto Stamm (email: rstamm@ucsc.edu | [github](https://github.com/retospect)) for guidance in developing and publishing a package to the Python Package Index

