Metadata-Version: 2.1
Name: MMSA-FET
Version: 0.3.1
Summary: A Tool for extracting multimodal features from videos.
Home-page: https://github.com/FlameSky-S/MMSA_FET
Author: THUIAR
Maintainer: FlameSky
License: UNKNOWN
Project-URL: Documentation, https://github.com/FlameSky-S/MMSA-FET/wiki
Project-URL: Tracker, https://github.com/FlameSky-S/MMSA_FET/issues
Keywords: multimodal,sentiment analysis,feature extraction,mosi,mosei
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE

# MMSA-Feature Extraction Toolkit

[![](https://badgen.net/badge/license/GPL-3.0/green)](#License) 
[![](https://badgen.net/pypi/v/MMSA-FET)](https://pypi.org/project/MMSA-FET/) 
![](https://badgen.net/pypi/python/MMSA-FET/)
[![](https://badgen.net/badge/contact/THUIAR/purple)](https://thuiar.github.io/)

MMSA-Feature Extraction Toolkit extracts multimodal features for Multimodal Sentiment Analysis Datasets. It integrates several commonly used tools for visual, acoustic and text modality. The extracted features are compatible with the [MMSA](https://github.com/thuiar/MMSA) Framework and thus can be used directly. The tool can also extract features for single videos.

This work is included in the ACL-2022 DEMO paper: [M-SENA: An Integrated Platform for Multimodal Sentiment Analysis](). If you find our work useful, don't hesitate to cite our paper. Thank you!

```text

```

### Features

- Extract fully customized features for single videos or datasets. 
- Integrate some most commonly used tools, including Librosa, OpenFace, Transformers, etc. 
- Support Active Speaker Detection in case multiple faces exists in a video. 
- Easy to use, provides Python APIs and commandline tools. 
- Extracted features are compatible with [MMSA](https://github.com/thuiar/MMSA), a unified training & testing framework for Multimodal Sentiment Analysis.

## 1. Installation

MMSA-Feature Extraction Toolkit is available from PyPI. Due to package size limitation on PyPi, large model files cannot be shipped with the package. Users need to run a post install command to download these files manually. If you can't access Google Drive, please refer to [this page](https://github.com/FlameSky-S/MMSA-FET/wiki/Dependency-Installation#2-post-installation-script) for manual download. 

```bash
# Install package from PyPI
$ pip install MMSA-FET
# Download models & libraries from Google Drive. Use --proxy if needed.
$ python -m MSA_FET install
```

> **Note:** For the OpenFaceExtractor to work on Linux Platforms, a few system-wide dependancies are needed. See [Dependency Installation](https://github.com/FlameSky-S/MMSA-FET/wiki/Dependency-Installation) for more information.

## 2. Quick Start

MMSA-FET is fairly easy to use. You can either call API in python or use commandline interface. Below is a basic example using python APIs.

> **Note:** To extract features for datasets, the datasets need to be organized in a specific file structure, and a `label.csv` file is needed. See [Dataset and Structure](https://github.com/FlameSky-S/MMSA-FET/wiki/Dataset-and-Structure) for details. Raw video files and label files for MOSI, MOSEI and CH-SIMS can be downloaded from [BaiduYunDisk](https://pan.baidu.com/s/14XkD2kCGnJGMqnCLNzi6MQ?pwd=g9n5).

```python
from MSA_FET import FeatureExtractionTool

# initialize with default librosa config which only extracts audio features
fet = FeatureExtractionTool("librosa")

# alternatively initialize with a custom config file
fet = FeatureExtractionTool("custom_config.json")

# extract features for single video
feature = fet.run_single("input.mp4")
print(feature)

# extract for dataset & save features to file
feature = fet.run_dataset(dataset_dir="~/MOSI", out_file="output/feature.pkl")
```

The `custom_config.json` is the path to a custom config file, the format of which is introduced below.

For detailed usage, please read [APIs](https://github.com/FlameSky-S/MMSA-FET/wiki/APIs) and [Command Line Arguments](https://github.com/FlameSky-S/MMSA-FET/wiki/Command-Line-Arguments).

## 3. Config File

MMSA-FET comes with a few example configs which can be used like below.

```python
# Each supported tool has an example config
fet = FeatureExtractionTool(config="librosa")
fet = FeatureExtractionTool(config="opensmile")
fet = FeatureExtractionTool(config="wav2vec")
fet = FeatureExtractionTool(config="openface")
fet = FeatureExtractionTool(config="mediapipe")
fet = FeatureExtractionTool(config="bert")
fet = FeatureExtractionTool(config="roberta")
```

For customized features, you can: 

1. Edit the default configs and pass a dictionary to the config parameter like the example below:

```python
from MSA_FET import FeatureExtractionTool, get_default_config

# here we only extract audio and video features
config_a = get_default_config('opensmile')
config_v = get_default_config('openface')

# modify default config
config_a['audio']['args']['feature_level'] = 'LowLevelDescriptors'

# combine audio and video configs
config = {**config_a, **config_v}

# initialize
fet = FeatureExtractionTool(config=config)
```

2. Provide a config json file. The below example extracts features of all three modalities. To extract unimodal features, just remove unnecessary sections from the file.

```json
{
  "audio": {
    "tool": "librosa",
    "sample_rate": null,
    "args": {
      "mfcc": {
        "n_mfcc": 20,
        "htk": true
      },
      "rms": {},
      "zero_crossing_rate": {},
      "spectral_rolloff": {},
      "spectral_centroid": {}
    }
  },
  "video": {
    "tool": "openface",
    "fps": 25,
    "average_over": 3,
    "args": {
      "hogalign": false,
      "simalign": false,
      "nobadaligned": false,
      "landmark_2D": true,
      "landmark_3D": false,
      "pdmparams": false,
      "head_pose": true,
      "action_units": true,
      "gaze": true,
      "tracked": false
    }
  },
  "text": {
    "model": "bert",
    "device": "cpu",
    "pretrained": "models/bert_base_uncased",
    "args": {}
  }
}
```

## 4. Supported Tools & Features

### 4.1 Audio Tools

- **Librosa** ([link](https://librosa.org/doc/latest/index.html))

  Supports all librosa features listed [here](https://librosa.org/doc/latest/feature.html), including: [mfcc](https://librosa.org/doc/latest/generated/librosa.feature.mfcc.html#librosa.feature.mfcc), [rms](https://librosa.org/doc/latest/generated/librosa.feature.rms.html#librosa.feature.rms), [zero_crossing_rate](https://librosa.org/doc/latest/generated/librosa.feature.zero_crossing_rate.html#librosa.feature.zero_crossing_rate), [spectral_rolloff](https://librosa.org/doc/latest/generated/librosa.feature.spectral_rolloff.html#librosa.feature.spectral_rolloff), [spectral_centroid](https://librosa.org/doc/latest/generated/librosa.feature.spectral_centroid.html#librosa.feature.spectral_centroid), etc. Detailed configurations can be found [here](https://github.com/FlameSky-S/MMSA-FET/wiki/Configurations#11-librosa).

- **openSMILE** ([link](https://audeering.github.io/opensmile-python/))

  Supports all feature sets listed [here](https://audeering.github.io/opensmile-python/api-smile.html#featureset), including: [ComParE_2016](http://www.tangsoo.de/documents/Publications/Schuller16-TI2.pdf), [GeMAPS](https://sail.usc.edu/publications/files/eyben-preprinttaffc-2015.pdf), [eGeMAPS](https://sail.usc.edu/publications/files/eyben-preprinttaffc-2015.pdf), emobase, etc. Detailed configurations can be found [here](https://github.com/FlameSky-S/MMSA-FET/wiki/Configurations#12-opensmile).

- **Wav2vec2** ([link](https://huggingface.co/docs/transformers/model_doc/wav2vec2))

  Integrated from huggingface transformers. Detailed configurations can be found [here](https://github.com/FlameSky-S/MMSA-FET/wiki/Configurations#13-wav2vec2).

### 4.2 Video Tools

- **OpenFace** ([link](https://github.com/TadasBaltrusaitis/OpenFace))

  Supports all features in OpenFace's FeatureExtraction binary, including: facial landmarks in 2D and 3D, head pose, gaze related, facial action units, HOG binary files. Details of these features can be found in the OpenFace Wiki [here](https://github.com/TadasBaltrusaitis/OpenFace/wiki/Output-Format) and [here](https://github.com/TadasBaltrusaitis/OpenFace/wiki/Action-Units). Detailed configurations can be found [here](https://github.com/FlameSky-S/MMSA-FET/wiki/Configurations#21-openface).

- **MediaPipe** ([link](https://google.github.io/mediapipe/))

  Supports [face](https://google.github.io/mediapipe/solutions/face_mesh.html) mesh and [holistic](https://google.github.io/mediapipe/solutions/holistic)(face, hand, pose) solutions. Detailed configurations can be found [here](https://github.com/FlameSky-S/MMSA-FET/wiki/Configurations#22-mediapipe).

- **TalkNet**([link](https://github.com/TaoRuijie/TalkNet_ASD))

  TalkNet is used to support Active Speaker Detection in case there are multiple human faces in the video. 

### 4.3 Text Tools

- **BERT** ([link](https://huggingface.co/docs/transformers/model_doc/bert))

  Integrated from huggingface transformers. Detailed configurations can be found [here](https://github.com/FlameSky-S/MMSA-FET/wiki/Configurations#31-bert).

- **XLNet** ([link](https://huggingface.co/docs/transformers/model_doc/xlnet))

  Integrated from huggingface transformers. Detailed configurations can be found [here](https://github.com/FlameSky-S/MMSA-FET/wiki/Configurations#32-xlnet).



