Metadata-Version: 2.1
Name: TrollHunter
Version: 0.3.3
Summary: TrollHunter
Home-page: https://github.com/StanGirard/TrollHunter
License: GPL
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Development Status :: 5 - Production/Stable
Description-Content-Type: text/markdown
Requires-Dist: aiodns (==2.0.0)
Requires-Dist: aiofiles (==0.4.0)
Requires-Dist: aiohttp (==3.6.2)
Requires-Dist: aiohttp-socks (==0.3.4)
Requires-Dist: amqp (==2.5.2)
Requires-Dist: aniso8601 (==8.0.0)
Requires-Dist: async-timeout (==3.0.1)
Requires-Dist: attrs (==19.3.0)
Requires-Dist: beautifulsoup4 (==4.8.2)
Requires-Dist: billiard (==3.6.3.0)
Requires-Dist: blinker (==1.4)
Requires-Dist: cchardet (==2.1.5)
Requires-Dist: celery (==4.4.1)
Requires-Dist: certifi (==2019.11.28)
Requires-Dist: cffi (==1.14.0)
Requires-Dist: chardet (==3.0.4)
Requires-Dist: Click (==7.0)
Requires-Dist: elasticsearch (==7.5.1)
Requires-Dist: fake-useragent (==0.1.11)
Requires-Dist: Flask (==1.1.1)
Requires-Dist: Flask-Jsonpify (==1.5.0)
Requires-Dist: Flask-RESTful (==0.3.8)
Requires-Dist: Flask-SQLAlchemy (==2.4.1)
Requires-Dist: geographiclib (==1.50)
Requires-Dist: geopy (==1.21.0)
Requires-Dist: googletransx (==2.4.2)
Requires-Dist: h11 (==0.9.0)
Requires-Dist: h2 (==3.2.0)
Requires-Dist: hpack (==3.0.0)
Requires-Dist: hyperframe (==5.2.0)
Requires-Dist: idna (==2.9)
Requires-Dist: influxdb (==5.2.3)
Requires-Dist: importlib-metadata (==1.5.0)
Requires-Dist: itsdangerous (==1.1.0)
Requires-Dist: Jinja2 (==2.11.1)
Requires-Dist: kombu (==4.6.8)
Requires-Dist: MarkupSafe (==1.1.1)
Requires-Dist: multidict (==4.7.5)
Requires-Dist: numpy (==1.18.1)
Requires-Dist: pandas (==1.0.1)
Requires-Dist: priority (==1.3.0)
Requires-Dist: psycopg2-binary (==2.8.4)
Requires-Dist: pycares (==3.1.1)
Requires-Dist: pycparser (==2.19)
Requires-Dist: PySocks (==1.7.1)
Requires-Dist: python-dateutil (==2.8.1)
Requires-Dist: python-dotenv (==0.12.0)
Requires-Dist: pytz (==2019.3)
Requires-Dist: requests (==2.23.0)
Requires-Dist: schedule (==0.6.0)
Requires-Dist: six (==1.14.0)
Requires-Dist: soupsieve (==2.0)
Requires-Dist: SQLAlchemy (==1.3.13)
Requires-Dist: toml (==0.10.0)
Requires-Dist: typing-extensions (==3.7.4.1)
Requires-Dist: Unidecode (==1.1.1)
Requires-Dist: urllib3 (==1.25.8)
Requires-Dist: vine (==1.3.0)
Requires-Dist: Werkzeug (==1.0.0)
Requires-Dist: wsproto (==0.15.0)
Requires-Dist: yarl (==1.4.2)
Requires-Dist: zipp (==3.1.0)
Requires-Dist: nltk (==3.4.5)
Requires-Dist: rake-nltk (==1.0.4)

# TrollHunter

TrollHunter is a Twitter Crawler & News Website Indexer.
It aims at finding Troll Farmers & Fake News on Twitter.

It composed of three parts:
- Twint API to extract information about a tweet or a user
- News Indexer which indexes all the articles of a website and extract its keywords
- Analysis of the tweets and news

## Installation

### Docker

TrollHunter requires many services to run
- ELK ( Elastic Search, Logstash, Kibana)
- InfluxDb & Grafana
- RabbitMQ

You can either launch them individually if you already have them setup or use our `docker-compose.yml`

- Install Docker
- Run `docker-compose up -d`

Change the `.env` with the required values

You can either run
```Bash
pip3 install TrollHunter
```
or clone the project and run 
```Bash
pip3 install -r requirements.txt
```

## Twint API


## News Indexer

The second main part of the project is the crawler and indexer of news.

For this, we use the sitemap xml file of news websites to crawl all the articles. In a sitemap file, we extract the tag
*sitemap* and *url*.

The *sitemap* tag is a link to a child sitemap xml file for a specific category of articles in the website.

The *url* tag represents an article/news of the website.  

The root url of a sitemap is stored in a postgres database with a trust level of the website (Oriented, Verified,
Fake News, ...) and headers. The headers are the tag we want to extract from the *url* tag which contains details about
the article (title, keywords, publication date, ...).

The headers are the list of fields use in the index pattern of ElasticSearch.

In crawling sitemaps, we insert the new child sitemap in the database with the last modification date or update it for
the ones already in the database. The last modification date is used to crawl only sitemaps which change since the
last crawling.

The data extracts from the *url* tags are built in a dataframe then sent in ElasticSearch for further utilisation with 
the request in Twint API.

In the same time, some sitemaps don't provide the keywords for their articles. Hence, from ElasticSearch we retrieve the
entries without keywords. Then, we download the content of the article and extract the keywords thanks to NLP. Finally,
we update the entries in ElasticSearch.

#### Run
For the crawler/indexer:

```python
from TrollHunter.news_crawler import scheduler_news

scheduler_news(time_interval)
```

For updating keywords:
```python
from TrollHunter.news_crawler import scheduler_keywords

scheduler_keywords(time_interval, max_entry)
```

Or see with the [main](https://github.com/StanGirard/TrollHunter/tree/master/docker/news_crawler) use with docker.  





