Metadata-Version: 2.1
Name: WebWeaver
Version: 1.2
Summary: A package used for web crawling
Author: Shubakar Poda & Nirmal Babu
Author-email: redblack09062024@gmail.com
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: requests==2.32.3



# WebWeaver

WebWeaver is a Python package for crawling and extracting URLs from web pages. It provides an easy-to-use interface for crawling a single page or an entire site, while handling errors and incomplete URLs gracefully. All crawling functionality is encapsulated within the `WebWeaver` class.

## Features
- **`crawl_url(url)`**: Given a URL, this method returns a list of all URLs found on the page.
- **`crawl_site(urls, limit, timeout)`**: Crawls multiple URLs with the ability to limit the number of pages and set a timeout for each page to load. It returns an object of the `UrlList` class containing information about successfully crawled URLs, incomplete URLs, and error-causing URLs.

## Installation

Install the package using `pip`:

```bash
pip install WebWeaver
```

## Usage

### `WebWeaver` Class

The `WebWeaver` class provides methods for URL extraction and site crawling.

#### `crawl_url(url)`
Extracts all URLs found on a given web page.

**Parameters**:
- `url (str)`: The URL of the page you want to crawl.

**Returns**:
- `list`: A list of URLs found on the page.

**Example**:
```python
from WebWeaver import webWeaver

# Instantiate the WebWeaver class
weaver = webWeaver.WebWeaver()

# Crawl a single URL
urls = weaver.crawl_url("https://example.com")
print(urls)
```

#### `crawl_site(urls, limit, timeout)`
Crawls multiple web pages and returns an `UrlList` object that categorizes URLs into three sets: successfully crawled URLs, incomplete URLs, and URLs that caused errors.

**Parameters**:
- `urls (list)`: A list of URLs to start crawling.
- `limit (int)`: The maximum number of pages to crawl.
- `timeout (int)`: The time limit (in seconds) for each page to load.

**Returns**:
- `UrlList`: An object containing three sets:
  - `urls`: A set of all successfully crawled and retrieved URLs.
  - `abnormal_urls`: A set of incomplete or malformed URLs extracted from the web pages.
  - `error_urls`: A set of URLs that caused errors when trying to make a request.

**Example**:
```python
from WebWeaver import webWeaver

# Instantiate the WebWeaver class
weaver = webWeaver.WebWeaver()

# Crawl multiple URLs
urls_to_crawl = ["https://example.com", "https://anotherexample.com"]
result = weaver.crawl_site(urls_to_crawl, limit=10, timeout=5)

# Accessing the sets from the result
print("Crawled URLs:", result.urls)
print("Abnormal URLs:", result.abnormal_urls)
print("Error URLs:", result.error_urls)
```

### `UrlList` Class
The `crawl_site` method returns an object of the `UrlList` class, which contains the following sets:

- `urls (set)`: A set of all successfully crawled URLs.
- `abnormal_urls (set)`: A set of incomplete or malformed URLs found during the crawl.
- `error_urls (set)`: A set of URLs that caused errors when attempting to access them.

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE.txt) file for details.

## Contributing
Contributions are welcome! Feel free to open an issue or submit a pull request.

---

Happy crawling!
