Metadata-Version: 2.1
Name: PredDNAContam
Version: 0.0.1
Summary: A Machine Learning Model to Estimate Within-Species DNA Contamination.
Author: Raziyeh Mohseni
Author-email: raziyeh.mohseni.y@gmail.com
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pandas==1.4.4
Requires-Dist: numpy>=1.19.5
Requires-Dist: scikit-learn==1.1.2
Requires-Dist: matplotlib==3.6.2
Requires-Dist: seaborn==0.12.2
Requires-Dist: joblib==1.4.2

# PredDNAContam

PredDNAContam is a tool for DNA contamination prediction from biosample data.

## Input File Format (CSV)

When using **PredDNAContam**, your input data should be in CSV format with the following columns:

| Column  | Description |
|---------|------------|
| GQ      | Genotype quality |
| DP      | Total read depth |
| AF      | Allele frequency |
| VAF     | Variant allele frequency |

### Example CSV File
The CSV file is generated by extracting the following key features from a VCF (Variant Call Format) file for each variant.

Here’s an example of how the CSV should look after extracting these features:

```csv
GQ,DP,AF,VAF
20,47,0.5,0.23
60,25,0.5,0.24
23,55,0.5,0.78
...


### Example config.txt file:

Before running PredDNAContam, you need to configure the paths in the config.txt file. This file contains important directory paths and filenames, which should be set as follows: 

input_dir=/path/to/csv_files
output_dir=/path/to/output_directory/output_PredDNAcontam
model_filename=/path/to/PredDNAContam_model/Random_Forest_Contamination_Model.joblib
scaler_filename=/path/to/PredDNAContam_model_scaler/scaler.joblib


## Installation

You can install this package using:

```bash
pip install PredDNAContam
