ProxyBroker

[Finder | Checker | Server]

https://img.shields.io/pypi/v/proxybroker.svg?style=flat-square https://img.shields.io/travis/constverum/ProxyBroker.svg?style=flat-square https://img.shields.io/pypi/wheel/proxybroker.svg?style=flat-square https://img.shields.io/pypi/pyversions/proxybroker.svg?style=flat-square https://img.shields.io/pypi/l/proxybroker.svg?style=flat-square

ProxyBroker is an open source tool that asynchronously finds public proxies from multiple sources and concurrently checks them.

_images/index_find_example.gif

Features

  • Finds more than 7000 working proxies from ~50 sources.
  • Support protocols: HTTP(S), SOCKS4/5. Also CONNECT method to ports 80 and 23 (SMTP).
  • Proxies may be filtered by type, anonymity level, response time, country and status in DNSBL.
  • Work as a proxy server that distributes incoming requests to external proxies. With automatic proxy rotation.
  • All proxies are checked to support Cookies and Referer (and POST requests if required).
  • Automatically removes duplicate proxies.
  • Is asynchronous.

Requirements

Installation

To install last stable release from pypi:

$ pip install proxybroker

The latest development version can be installed directly from GitHub:

$ pip install -U git+https://github.com/constverum/ProxyBroker.git

Usage

CLI Examples

Find

Find and show 10 HTTP(S) proxies from United States with the high level of anonymity:

$ proxybroker find --types HTTP HTTPS --lvl High --countries US --strict -l 10
_images/cli_find_example.gif

Grab

Find and save to a file 10 US proxies (without a check):

$ proxybroker grab --countries US --limit 10 --outfile ./proxies.txt
_images/cli_grab_example.gif

Serve

Run a local proxy server that distributes incoming requests to a pool of found HTTP(S) proxies with the high level of anonymity:

$ proxybroker serve --host 127.0.0.1 --port 8888 --types HTTP HTTPS --lvl High
_images/cli_serve_example.gif

Note

Run proxybroker --help for more information on the options available.

Run proxybroker <command> --help for more information on a command.

Basic code example

Find and show 10 working HTTP(S) proxies:

import asyncio
from proxybroker import Broker

async def show(proxies):
    while True:
        proxy = await proxies.get()
        if proxy is None: break
        print('Found proxy: %s' % proxy)

proxies = asyncio.Queue()
broker = Broker(proxies)
tasks = asyncio.gather(
    broker.find(types=['HTTP', 'HTTPS'], limit=10),
    show(proxies))

loop = asyncio.get_event_loop()
loop.run_until_complete(tasks)

More examples.

TODO

  • Check the ping, response time and speed of data transfer
  • Check site access (Google, Twitter, etc) and even your own custom URL’s
  • Information about uptime
  • Checksum of data returned
  • Support for proxy authentication
  • Finding outgoing IP for cascading proxy
  • The ability to specify the address of the proxy without port (try to connect on defaulted ports)

Contributing

  • Fork it: https://github.com/constverum/ProxyBroker/fork
  • Create your feature branch: git checkout -b my-new-feature
  • Commit your changes: git commit -am ‘Add some feature’
  • Push to the branch: git push origin my-new-feature
  • Submit a pull request!

License

Licensed under the Apache License, Version 2.0

This product includes GeoLite2 data created by MaxMind, available from http://www.maxmind.com.

Contents:

API Reference

Broker

class proxybroker.api.Broker(queue=None, timeout=8, max_conn=200, max_tries=3, judges=None, providers=None, verify_ssl=False, loop=None, **kwargs)[source]

The Broker.

One broker to rule them all, one broker to find them,
One broker to bring them all and in the darkness bind them.
Parameters:
  • queue (asyncio.Queue) – (optional) Queue of found/checked proxies
  • timeout (int) – (optional) Timeout of a request in seconds
  • max_conn (int) – (optional) The maximum number of concurrent checks of proxies
  • max_tries (int) – (optional) The maximum number of attempts to check a proxy
  • judges (list) – (optional) Urls of pages that show HTTP headers and IP address. Or Judge objects
  • providers (list) – (optional) Urls of pages where to find proxies. Or Provider objects
  • verify_ssl (bool) – (optional) Flag indicating whether to check the SSL certificates. Set to True to check ssl certifications
  • loop – (optional) asyncio compatible event loop

Deprecated since version 0.2.0: Use max_conn and max_tries instead of max_concurrent_conn and attempts_conn.

find(*, types=None, data=None, countries=None, post=False, strict=False, dnsbl=None, limit=0, **kwargs)[source]

Gather and check proxies from providers or from a passed data.

Example of usage.

Parameters:
  • types (list) – Types (protocols) that need to be check on support by proxy. Supported: HTTP, HTTPS, SOCKS4, SOCKS5, CONNECT:80, CONNECT:25 And levels of anonymity (HTTP only): Transparent, Anonymous, High
  • data – (optional) String or list with proxies. Also can be a file-like object supports read() method. Used instead of providers
  • countries (list) – (optional) List of ISO country codes where should be located proxies
  • post (bool) – (optional) Flag indicating use POST instead of GET for requests when checking proxies
  • strict (bool) – (optional) Flag indicating that anonymity levels of types (protocols) supported by a proxy must be equal to the requested types and levels of anonymity. By default, strict mode is off and for a successful check is enough to satisfy any one of the requested types
  • dnsbl (list) – (optional) Spam databases for proxy checking. Wiki
  • limit (int) – (optional) The maximum number of proxies
Raises:

ValueError – If types not given.

Changed in version 0.2.0: Added: post, strict, dnsbl. Changed: types is required.

grab(*, countries=None, limit=0)[source]

Gather proxies from the providers without checking.

Parameters:
  • countries (list) – (optional) List of ISO country codes where should be located proxies
  • limit (int) – (optional) The maximum number of proxies

Example of usage.

serve(host='127.0.0.1', port=8888, limit=100, **kwargs)[source]

Start a local proxy server.

The server distributes incoming requests to a pool of found proxies.

When the server receives an incoming request, it chooses the optimal proxy (based on the percentage of errors and average response time) and passes to it the incoming request.

In addition to the parameters listed below are also accept all the parameters of the find() method and passed it to gather proxies to a pool.

Example of usage.

Parameters:
  • host (str) – (optional) Host of local proxy server
  • port (int) – (optional) Port of local proxy server
  • limit (int) – (optional) When will be found a requested number of working proxies, checking of new proxies will be lazily paused. Checking will be resumed if all the found proxies will be discarded in the process of working with them (see max_error_rate, max_resp_time). And will continue until it finds one working proxy and paused again. The default value is 100
  • max_tries (int) – (optional) The maximum number of attempts to handle an incoming request. If not specified, it will use the value specified during the creation of the Broker object. Attempts can be made with different proxies. The default value is 3
  • min_req_proxy (int) – (optional) The minimum number of processed requests to estimate the quality of proxy (in accordance with max_error_rate and max_resp_time). The default value is 5
  • max_error_rate (int) – (optional) The maximum percentage of requests that ended with an error. For example: 0.5 = 50%. If proxy.error_rate exceeds this value, proxy will be removed from the pool. The default value is 0.5
  • max_resp_time (int) – (optional) The maximum response time in seconds. If proxy.avg_resp_time exceeds this value, proxy will be removed from the pool. The default value is 8
  • prefer_connect (bool) – (optional) Flag that indicates whether to use the CONNECT method if possible. For example: If is set to True and a proxy supports HTTP proto (GET or POST requests) and CONNECT method, the server will try to use CONNECT method and only after that send the original request. The default value is False
  • http_allowed_codes (list) – (optional) Acceptable HTTP codes returned by proxy on requests. If a proxy return code, not included in this list, it will be considered as a proxy error, not a wrong/unavailable address. For example, if a proxy will return a 404 Not Found response - this will be considered as an error of a proxy. Checks only for HTTP protocol, HTTPS not supported at the moment. By default the list is empty and the response code is not verified
  • backlog (int) – (optional) The maximum number of queued connections passed to listen. The default value is 100
Raises:

ValueError – If limit is less than or equal to zero. Because a parsing of providers will be endless

New in version 0.2.0.

show_stats(verbose=False, **kwargs)[source]

Show statistics on the found proxies.

Useful for debugging, but you can also use if you’re interested.

Parameters:verbose – Flag indicating whether to print verbose stats

Deprecated since version 0.2.0: Use verbose instead of full.

stop()[source]

Stop all tasks, and the local proxy server if it’s running.

Proxy

class proxybroker.proxy.Proxy(host=None, port=None, types=(), timeout=8, verify_ssl=False)[source]

Proxy.

Parameters:
  • host (str) – IP address of the proxy
  • port (int) – Port of the proxy
  • types (tuple) – (optional) List of types (protocols) which may be supported by the proxy and which can be checked to work with the proxy
  • timeout (int) – (optional) Timeout of a connection and receive a response in seconds
  • verify_ssl (bool) – (optional) Flag indicating whether to check the SSL certificates. Set to True to check ssl certifications
Raises:

ValueError – If the host not is IP address, or if the port > 65535

classmethod create(host, *args, **kwargs)[source]

Asynchronously create a Proxy object.

Parameters:
  • host (str) – A passed host can be a domain or IP address. If the host is a domain, try to resolve it
  • *args (str) – (optional) Positional arguments that Proxy takes
  • **kwargs (str) – (optional) Keyword arguments that Proxy takes
Returns:

Proxy object

Return type:

proxybroker.Proxy

Raises:
  • ResolveError – If could not resolve the host
  • ValueError – If the port > 65535
get_log()[source]

Proxy log.

Returns:The proxy log in format: (negotaitor, msg, runtime)
Return type:tuple

New in version 0.2.0.

avg_resp_time

The average connection/response time.

Return type:float
error_rate

Error rate: from 0 to 1.

For example: 0.7 = 70% requests ends with error.

Return type:float

New in version 0.2.0.

geo

Geo information about IP address of the proxy.

Returns:
Named tuple with fields:
  • code - ISO country code
  • name - Full name of country
Return type:collections.namedtuple

Changed in version 0.2.0: In previous versions return a dictionary, now named tuple.

is_working

True if the proxy is working, False otherwise.

Return type:bool
types

Types (protocols) supported by the proxy.

Where key is type, value is level of anonymity (only for HTTP, for other types level always is None).
Available types: HTTP, HTTPS, SOCKS4, SOCKS5, CONNECT:80, CONNECT:25
Available levels: Transparent, Anonymous, High.
Return type:dict

Provider

class proxybroker.providers.Provider(url=None, proto=(), max_conn=4, max_tries=3, timeout=20, loop=None)[source]

Proxy provider.

Provider - a website that publish free public proxy lists.

Parameters:
  • url (str) – Url of page where to find proxies
  • proto (tuple) – (optional) List of the types (protocols) that may be supported by proxies returned by the provider. Then used as Proxy.types
  • max_conn (int) – (optional) The maximum number of concurrent connections on the provider
  • max_tries (int) – (optional) The maximum number of attempts to receive response
  • timeout (int) – (optional) Timeout of a request in seconds
get_proxies()[source]

Receive proxies from the provider and return them.

Returns:proxies
proxies

Return all found proxies.

Returns:Set of tuples with proxy hosts, ports and types (protocols) that may be supported (from proto).
For example:
{(‘192.168.0.1’, ‘80’, (‘HTTP’, ‘HTTPS’), …)}
Return type:set

Examples

All examples are also available at GitHub

"""Find and show 10 working HTTP(S) proxies."""

import asyncio
from proxybroker import Broker

async def show(proxies):
    while True:
        proxy = await proxies.get()
        if proxy is None: break
        print('Found proxy: %s' % proxy)

proxies = asyncio.Queue()
broker = Broker(proxies)
tasks = asyncio.gather(
    broker.find(types=['HTTP', 'HTTPS'], limit=10),
    show(proxies))

loop = asyncio.get_event_loop()
loop.run_until_complete(tasks)

Download this example.


"""Find 10 working HTTP(S) proxies and save them to a file."""

import asyncio
from proxybroker import Broker


async def save(proxies, filename):
    """Save proxies to a file."""
    with open(filename, 'w') as f:
        while True:
            proxy = await proxies.get()
            if proxy is None:
                break
            proto = 'https' if 'HTTPS' in proxy.types else 'http'
            row = '%s://%s:%d\n' % (proto, proxy.host, proxy.port)
            f.write(row)


def main():
    proxies = asyncio.Queue()
    broker = Broker(proxies)
    tasks = asyncio.gather(broker.find(types=['HTTP', 'HTTPS'], limit=10),
                           save(proxies, filename='proxies.txt'))
    loop = asyncio.get_event_loop()
    loop.run_until_complete(tasks)


if __name__ == '__main__':
    main()

Download this example.


"""Find working proxies and use them concurrently.

Note: Pay attention to Broker.serve(), instead of the code listed below.
      Perhaps it will be much useful and friendlier.
"""

import asyncio
from urllib.parse import urlparse

import aiohttp

from proxybroker import Broker, ProxyPool
from proxybroker.errors import NoProxyError


async def get_pages(urls, proxy_pool, timeout=10, loop=None):
    tasks = [fetch(url, proxy_pool, timeout, loop) for url in urls]
    for task in asyncio.as_completed(tasks):
        url, content = await task
        print('Done! url: %s; content: %.30s' % (url, content))


async def fetch(url, proxy_pool, timeout, loop):
    resp, proxy = None, None
    try:
        proxy = await proxy_pool.get(scheme=urlparse(url).scheme)
        proxy_url = 'http://%s:%d' % (proxy.host, proxy.port)
        with aiohttp.Timeout(timeout, loop=loop):
            async with aiohttp.ClientSession(loop=loop) as session:
                async with session.get(url, proxy=proxy_url) as response:
                    resp = await response.read()
    except (aiohttp.errors.ClientOSError, aiohttp.errors.ClientResponseError,
            aiohttp.errors.ServerDisconnectedError, asyncio.TimeoutError,
            NoProxyError) as e:
        print('Error. url: %s; error: %r', url, e)
    finally:
        if proxy:
            proxy_pool.put(proxy)
        return (url, resp)


def main():
    loop = asyncio.get_event_loop()

    proxies = asyncio.Queue(loop=loop)
    proxy_pool = ProxyPool(proxies)

    judges = ['http://httpbin.org/get?show_env',
              'https://httpbin.org/get?show_env']
    providers = ['http://www.proxylists.net/', 'http://fineproxy.org/eng/fresh-proxies/']

    broker = Broker(
        proxies, timeout=8, max_conn=200, max_tries=3, verify_ssl=False,
        judges=judges, providers=providers, loop=loop)

    types = [('HTTP', ('Anonymous', 'High')), ]
    countries = ['US', 'DE', 'FR']

    urls = ['http://httpbin.org/get', 'http://httpbin.org/redirect/1',
            'http://httpbin.org/anything', 'http://httpbin.org/status/404']

    tasks = asyncio.gather(
        broker.find(types=types, countries=countries, strict=True, limit=10),
        get_pages(urls, proxy_pool, loop=loop))
    loop.run_until_complete(tasks)

    # broker.show_stats(verbose=True)


if __name__ == '__main__':
    main()

Download this example.


"""Find 10 working proxies supporting CONNECT method
   to 25 port (SMTP) and save them to a file."""

import asyncio
from proxybroker import Broker


async def save(proxies, filename):
    """Save proxies to a file."""
    with open(filename, 'w') as f:
        while True:
            proxy = await proxies.get()
            if proxy is None:
                break
            f.write('smtp://%s:%d\n' % (proxy.host, proxy.port))


def main():
    proxies = asyncio.Queue()
    broker = Broker(proxies, judges=['smtp://smtp.gmail.com'], max_tries=1)

    # Check proxy in spam databases (DNSBL). By default is disabled.
    # more databases: http://www.dnsbl.info/dnsbl-database-check.php
    dnsbl = ['bl.spamcop.net', 'cbl.abuseat.org', 'dnsbl.sorbs.net',
             'zen.spamhaus.org', 'bl.mcafee.com', 'spam.spamrats.com']

    tasks = asyncio.gather(
        broker.find(types=['CONNECT:25'], dnsbl=dnsbl, limit=10),
        save(proxies, filename='proxies.txt'))
    loop = asyncio.get_event_loop()
    loop.run_until_complete(tasks)

if __name__ == '__main__':
    main()

Download this example.


"""Gather proxies from the providers without
   checking and save them to a file."""

import asyncio
from proxybroker import Broker


async def save(proxies, filename):
    """Save proxies to a file."""
    with open(filename, 'w') as f:
        while True:
            proxy = await proxies.get()
            if proxy is None:
                break
            f.write('%s:%d\n' % (proxy.host, proxy.port))


def main():
    proxies = asyncio.Queue()
    broker = Broker(proxies)
    tasks = asyncio.gather(broker.grab(countries=['US', 'GB'], limit=10),
                           save(proxies, filename='proxies.txt'))
    loop = asyncio.get_event_loop()
    loop.run_until_complete(tasks)


if __name__ == '__main__':
    main()

Download this example.


"""Run a local proxy server that distributes
   incoming requests to external proxies."""

import asyncio
import aiohttp

from proxybroker import Broker


async def get_pages(urls, proxy_url):
    tasks = [fetch(url, proxy_url) for url in urls]
    for task in asyncio.as_completed(tasks):
        url, content = await task
        print('Done! url: %s; content: %.100s' % (url, content))


async def fetch(url, proxy_url):
    resp = None
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url, proxy=proxy_url) as response:
                resp = await response.read()
    except (aiohttp.errors.ClientOSError, aiohttp.errors.ClientResponseError,
            aiohttp.errors.ServerDisconnectedError) as e:
        print('Error. url: %s; error: %r' % (url, e))
    finally:
        return (url, resp)


def main():
    host, port = '127.0.0.1', 8888  # by default

    loop = asyncio.get_event_loop()

    types = [('HTTP', 'High'), 'HTTPS', 'CONNECT:80']
    codes = [200, 301, 302]

    broker = Broker(max_tries=1, loop=loop)

    # Broker.serve() also supports all arguments that are accepted
    # Broker.find() method: data, countries, post, strict, dnsbl.
    broker.serve(host=host, port=port, types=types, limit=10, max_tries=3,
                 prefer_connect=True, min_req_proxy=5, max_error_rate=0.5,
                 max_resp_time=8, http_allowed_codes=codes, backlog=100)

    urls = ['http://httpbin.org/get', 'https://httpbin.org/get',
            'http://httpbin.org/redirect/1', 'http://httpbin.org/status/404']

    proxy_url = 'http://%s:%d' % (host, port)
    loop.run_until_complete(get_pages(urls, proxy_url))

    broker.stop()


if __name__ == '__main__':
    main()

Download this example.

Change Log

`0.2.1`_ (Unreleased)

  • Added the --format flag, which indicating in what format the results will be presented.
  • Improved the --outfile flag behavior. Earlier it was required to wait until whole process has finished, now the result is available in real-time #39

0.2.0 (2017-09-17)

  • Added CLI interface
  • Added Broker.serve() function. Now ProxyBroker can work as a proxy server that distributes incoming requests to a pool of found proxies
  • To available types (protocols) added:
    • CONNECT:80 - CONNECT method to port 80
    • CONNECT:25 - CONNECT method to port 25 (SMTP)
  • Added new options of checking and filtering proxies. Broker.find() method has takes new parameters: post, strict, dnsbl. See documentation for more information
  • Added check proxies to support Cookies and Referer
  • Added gzip and deflate support
  • Broker attributes max_concurrent_conn and attempts_conn are deprecated, use max_conn and max_tries instead.
  • Parameter full in Broker.show_stats() is deprecated, use verbose instead
  • Parameter types in Broker.find() (and Broker.serve()) from now is required
  • ProxyChecker renamed to Checker. ProxyChecker class is deprecated, use Checker instead
  • Proxy.avgRespTime renamed to Proxy.avg_resp_time. Proxy.avgRespTime is deprecated, use Proxy.avg_resp_time instead
  • Improved documentation
  • Major refactoring

0.1.4 (2016-04-07)

  • Fixed bug when launched the second time to find proxies #7

0.1.3 (2016-03-26)

  • ProxyProvider renamed to Provider. ProxyProvider class is deprecated, use Provider instead.
  • Broker now accepts a list of providers and judges not only as strings but also objects of classes Provider and Judge
  • Fixed bug with signal handler on Windows #4

0.1.2 (2016-02-27)

  • Fixed bug with SIGINT on Linux
  • Fixed bug with clearing the queue of proxy check.

0.1 (2016-02-23)

  • Updated and added a few new providers
  • Few minor fix

0.1b4 (2016-01-21)

  • Added a few tests
  • Update documentation

0.1b3 (2016-01-16)

  • Few minor fix

0.1b2 (2016-01-10)

  • Few minor fix

0.1b1 (2015-12-29)

  • Changed the name of a PyProxyChecker on ProxyBroker in connection with the expansion of the concept.
  • Added support of multiple proxy providers.
  • Initial public release on PyPi
  • Many improvements and bug fixes

0.1a2 (2015-11-24)

  • Added support of multiple proxy judges.

0.1a1 (2015-11-11)

  • Initial commit with function of proxy checking

Indices and tables