ProxyBroker¶
[Finder | Checker | Server]
ProxyBroker is an open source tool that asynchronously finds public proxies from multiple sources and concurrently checks them.
Features¶
- Finds more than 7000 working proxies from ~50 sources.
- Support protocols: HTTP(S), SOCKS4/5. Also CONNECT method to ports 80 and 23 (SMTP).
- Proxies may be filtered by type, anonymity level, response time, country and status in DNSBL.
- Work as a proxy server that distributes incoming requests to external proxies. With automatic proxy rotation.
- All proxies are checked to support Cookies and Referer (and POST requests if required).
- Automatically removes duplicate proxies.
- Is asynchronous.
Installation¶
To install last stable release from pypi:
$ pip install proxybroker
The latest development version can be installed directly from GitHub:
$ pip install -U git+https://github.com/constverum/ProxyBroker.git
Usage¶
CLI Examples¶
Find¶
Find and show 10 HTTP(S) proxies from United States with the high level of anonymity:
$ proxybroker find --types HTTP HTTPS --lvl High --countries US --strict -l 10
Grab¶
Find and save to a file 10 US proxies (without a check):
$ proxybroker grab --countries US --limit 10 --outfile ./proxies.txt
Serve¶
Run a local proxy server that distributes incoming requests to a pool of found HTTP(S) proxies with the high level of anonymity:
$ proxybroker serve --host 127.0.0.1 --port 8888 --types HTTP HTTPS --lvl High
Note
Run proxybroker --help
for more information on the options available.
Run proxybroker <command> --help
for more information on a command.
Basic code example¶
Find and show 10 working HTTP(S) proxies:
import asyncio
from proxybroker import Broker
async def show(proxies):
while True:
proxy = await proxies.get()
if proxy is None: break
print('Found proxy: %s' % proxy)
proxies = asyncio.Queue()
broker = Broker(proxies)
tasks = asyncio.gather(
broker.find(types=['HTTP', 'HTTPS'], limit=10),
show(proxies))
loop = asyncio.get_event_loop()
loop.run_until_complete(tasks)
TODO¶
- Check the ping, response time and speed of data transfer
- Check site access (Google, Twitter, etc) and even your own custom URL’s
- Information about uptime
- Checksum of data returned
- Support for proxy authentication
- Finding outgoing IP for cascading proxy
- The ability to specify the address of the proxy without port (try to connect on defaulted ports)
Contributing¶
- Fork it: https://github.com/constverum/ProxyBroker/fork
- Create your feature branch: git checkout -b my-new-feature
- Commit your changes: git commit -am ‘Add some feature’
- Push to the branch: git push origin my-new-feature
- Submit a pull request!
License¶
Licensed under the Apache License, Version 2.0
This product includes GeoLite2 data created by MaxMind, available from http://www.maxmind.com.
Contents:
API Reference¶
Broker¶
-
class
proxybroker.api.
Broker
(queue=None, timeout=8, max_conn=200, max_tries=3, judges=None, providers=None, verify_ssl=False, loop=None, **kwargs)[source]¶ The Broker.
One broker to rule them all, one broker to find them,One broker to bring them all and in the darkness bind them.Parameters: - queue (asyncio.Queue) – (optional) Queue of found/checked proxies
- timeout (int) – (optional) Timeout of a request in seconds
- max_conn (int) – (optional) The maximum number of concurrent checks of proxies
- max_tries (int) – (optional) The maximum number of attempts to check a proxy
- judges (list) – (optional) Urls of pages that show HTTP headers and IP address.
Or
Judge
objects - providers (list) – (optional) Urls of pages where to find proxies.
Or
Provider
objects - verify_ssl (bool) – (optional) Flag indicating whether to check the SSL certificates. Set to True to check ssl certifications
- loop – (optional) asyncio compatible event loop
Deprecated since version 0.2.0: Use
max_conn
andmax_tries
instead ofmax_concurrent_conn
andattempts_conn
.-
find
(*, types=None, data=None, countries=None, post=False, strict=False, dnsbl=None, limit=0, **kwargs)[source]¶ Gather and check proxies from providers or from a passed data.
Parameters: - types (list) – Types (protocols) that need to be check on support by proxy. Supported: HTTP, HTTPS, SOCKS4, SOCKS5, CONNECT:80, CONNECT:25 And levels of anonymity (HTTP only): Transparent, Anonymous, High
- data – (optional) String or list with proxies. Also can be a file-like object supports read() method. Used instead of providers
- countries (list) – (optional) List of ISO country codes where should be located proxies
- post (bool) – (optional) Flag indicating use POST instead of GET for requests when checking proxies
- strict (bool) – (optional) Flag indicating that anonymity levels of types (protocols) supported by a proxy must be equal to the requested types and levels of anonymity. By default, strict mode is off and for a successful check is enough to satisfy any one of the requested types
- dnsbl (list) – (optional) Spam databases for proxy checking. Wiki
- limit (int) – (optional) The maximum number of proxies
Raises: ValueError – If
types
not given.Changed in version 0.2.0: Added:
post
,strict
,dnsbl
. Changed:types
is required.
-
grab
(*, countries=None, limit=0)[source]¶ Gather proxies from the providers without checking.
Parameters: - countries (list) – (optional) List of ISO country codes where should be located proxies
- limit (int) – (optional) The maximum number of proxies
-
serve
(host='127.0.0.1', port=8888, limit=100, **kwargs)[source]¶ Start a local proxy server.
The server distributes incoming requests to a pool of found proxies.
When the server receives an incoming request, it chooses the optimal proxy (based on the percentage of errors and average response time) and passes to it the incoming request.
In addition to the parameters listed below are also accept all the parameters of the
find()
method and passed it to gather proxies to a pool.Parameters: - host (str) – (optional) Host of local proxy server
- port (int) – (optional) Port of local proxy server
- limit (int) – (optional) When will be found a requested number of working
proxies, checking of new proxies will be lazily paused.
Checking will be resumed if all the found proxies will be discarded
in the process of working with them (see
max_error_rate
,max_resp_time
). And will continue until it finds one working proxy and paused again. The default value is 100 - max_tries (int) – (optional) The maximum number of attempts to handle an incoming
request. If not specified, it will use the value specified during
the creation of the
Broker
object. Attempts can be made with different proxies. The default value is 3 - min_req_proxy (int) – (optional) The minimum number of processed requests to estimate the
quality of proxy (in accordance with
max_error_rate
andmax_resp_time
). The default value is 5 - max_error_rate (int) – (optional) The maximum percentage of requests that ended with an error. For example: 0.5 = 50%. If proxy.error_rate exceeds this value, proxy will be removed from the pool. The default value is 0.5
- max_resp_time (int) – (optional) The maximum response time in seconds. If proxy.avg_resp_time exceeds this value, proxy will be removed from the pool. The default value is 8
- prefer_connect (bool) – (optional) Flag that indicates whether to use the CONNECT method if possible. For example: If is set to True and a proxy supports HTTP proto (GET or POST requests) and CONNECT method, the server will try to use CONNECT method and only after that send the original request. The default value is False
- http_allowed_codes (list) – (optional) Acceptable HTTP codes returned by proxy on requests.
If a proxy return code, not included in this list, it will be
considered as a proxy error, not a wrong/unavailable address.
For example, if a proxy will return a
404 Not Found
response - this will be considered as an error of a proxy. Checks only for HTTP protocol, HTTPS not supported at the moment. By default the list is empty and the response code is not verified - backlog (int) – (optional) The maximum number of queued connections passed to listen. The default value is 100
Raises: ValueError – If
limit
is less than or equal to zero. Because a parsing of providers will be endlessNew in version 0.2.0.
Proxy¶
-
class
proxybroker.proxy.
Proxy
(host=None, port=None, types=(), timeout=8, verify_ssl=False)[source]¶ Proxy.
Parameters: - host (str) – IP address of the proxy
- port (int) – Port of the proxy
- types (tuple) – (optional) List of types (protocols) which may be supported by the proxy and which can be checked to work with the proxy
- timeout (int) – (optional) Timeout of a connection and receive a response in seconds
- verify_ssl (bool) – (optional) Flag indicating whether to check the SSL certificates. Set to True to check ssl certifications
Raises: ValueError – If the host not is IP address, or if the port > 65535
-
classmethod
create
(host, *args, **kwargs)[source]¶ Asynchronously create a
Proxy
object.Parameters: - host (str) – A passed host can be a domain or IP address. If the host is a domain, try to resolve it
- *args (str) – (optional) Positional arguments that
Proxy
takes - **kwargs (str) – (optional) Keyword arguments that
Proxy
takes
Returns: Proxy
objectReturn type: proxybroker.Proxy
Raises: - ResolveError – If could not resolve the host
- ValueError – If the port > 65535
-
get_log
()[source]¶ Proxy log.
Returns: The proxy log in format: (negotaitor, msg, runtime) Return type: tuple New in version 0.2.0.
-
avg_resp_time
¶ The average connection/response time.
Return type: float
-
error_rate
¶ Error rate: from 0 to 1.
For example: 0.7 = 70% requests ends with error.
Return type: float New in version 0.2.0.
-
geo
¶ Geo information about IP address of the proxy.
Returns: - Named tuple with fields:
code
- ISO country codename
- Full name of country
Return type: collections.namedtuple Changed in version 0.2.0: In previous versions return a dictionary, now named tuple.
-
is_working
¶ True if the proxy is working, False otherwise.
Return type: bool
-
types
¶ Types (protocols) supported by the proxy.
Where key is type, value is level of anonymity (only for HTTP, for other types level always is None).Available types: HTTP, HTTPS, SOCKS4, SOCKS5, CONNECT:80, CONNECT:25Available levels: Transparent, Anonymous, High.Return type: dict
Provider¶
-
class
proxybroker.providers.
Provider
(url=None, proto=(), max_conn=4, max_tries=3, timeout=20, loop=None)[source]¶ Proxy provider.
Provider - a website that publish free public proxy lists.
Parameters: - url (str) – Url of page where to find proxies
- proto (tuple) – (optional) List of the types (protocols) that may be supported
by proxies returned by the provider. Then used as
Proxy.types
- max_conn (int) – (optional) The maximum number of concurrent connections on the provider
- max_tries (int) – (optional) The maximum number of attempts to receive response
- timeout (int) – (optional) Timeout of a request in seconds
-
proxies
¶ Return all found proxies.
Returns: Set of tuples with proxy hosts, ports and types (protocols) that may be supported (from proto
).- For example:
- {(‘192.168.0.1’, ‘80’, (‘HTTP’, ‘HTTPS’), …)}
Return type: set
Examples¶
All examples are also available at GitHub
"""Find and show 10 working HTTP(S) proxies."""
import asyncio
from proxybroker import Broker
async def show(proxies):
while True:
proxy = await proxies.get()
if proxy is None: break
print('Found proxy: %s' % proxy)
proxies = asyncio.Queue()
broker = Broker(proxies)
tasks = asyncio.gather(
broker.find(types=['HTTP', 'HTTPS'], limit=10),
show(proxies))
loop = asyncio.get_event_loop()
loop.run_until_complete(tasks)
"""Find 10 working HTTP(S) proxies and save them to a file."""
import asyncio
from proxybroker import Broker
async def save(proxies, filename):
"""Save proxies to a file."""
with open(filename, 'w') as f:
while True:
proxy = await proxies.get()
if proxy is None:
break
proto = 'https' if 'HTTPS' in proxy.types else 'http'
row = '%s://%s:%d\n' % (proto, proxy.host, proxy.port)
f.write(row)
def main():
proxies = asyncio.Queue()
broker = Broker(proxies)
tasks = asyncio.gather(broker.find(types=['HTTP', 'HTTPS'], limit=10),
save(proxies, filename='proxies.txt'))
loop = asyncio.get_event_loop()
loop.run_until_complete(tasks)
if __name__ == '__main__':
main()
"""Find working proxies and use them concurrently.
Note: Pay attention to Broker.serve(), instead of the code listed below.
Perhaps it will be much useful and friendlier.
"""
import asyncio
from urllib.parse import urlparse
import aiohttp
from proxybroker import Broker, ProxyPool
from proxybroker.errors import NoProxyError
async def get_pages(urls, proxy_pool, timeout=10, loop=None):
tasks = [fetch(url, proxy_pool, timeout, loop) for url in urls]
for task in asyncio.as_completed(tasks):
url, content = await task
print('Done! url: %s; content: %.30s' % (url, content))
async def fetch(url, proxy_pool, timeout, loop):
resp, proxy = None, None
try:
proxy = await proxy_pool.get(scheme=urlparse(url).scheme)
proxy_url = 'http://%s:%d' % (proxy.host, proxy.port)
with aiohttp.Timeout(timeout, loop=loop):
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, proxy=proxy_url) as response:
resp = await response.read()
except (aiohttp.errors.ClientOSError, aiohttp.errors.ClientResponseError,
aiohttp.errors.ServerDisconnectedError, asyncio.TimeoutError,
NoProxyError) as e:
print('Error. url: %s; error: %r', url, e)
finally:
if proxy:
proxy_pool.put(proxy)
return (url, resp)
def main():
loop = asyncio.get_event_loop()
proxies = asyncio.Queue(loop=loop)
proxy_pool = ProxyPool(proxies)
judges = ['http://httpbin.org/get?show_env',
'https://httpbin.org/get?show_env']
providers = ['http://www.proxylists.net/', 'http://fineproxy.org/eng/fresh-proxies/']
broker = Broker(
proxies, timeout=8, max_conn=200, max_tries=3, verify_ssl=False,
judges=judges, providers=providers, loop=loop)
types = [('HTTP', ('Anonymous', 'High')), ]
countries = ['US', 'DE', 'FR']
urls = ['http://httpbin.org/get', 'http://httpbin.org/redirect/1',
'http://httpbin.org/anything', 'http://httpbin.org/status/404']
tasks = asyncio.gather(
broker.find(types=types, countries=countries, strict=True, limit=10),
get_pages(urls, proxy_pool, loop=loop))
loop.run_until_complete(tasks)
# broker.show_stats(verbose=True)
if __name__ == '__main__':
main()
"""Find 10 working proxies supporting CONNECT method
to 25 port (SMTP) and save them to a file."""
import asyncio
from proxybroker import Broker
async def save(proxies, filename):
"""Save proxies to a file."""
with open(filename, 'w') as f:
while True:
proxy = await proxies.get()
if proxy is None:
break
f.write('smtp://%s:%d\n' % (proxy.host, proxy.port))
def main():
proxies = asyncio.Queue()
broker = Broker(proxies, judges=['smtp://smtp.gmail.com'], max_tries=1)
# Check proxy in spam databases (DNSBL). By default is disabled.
# more databases: http://www.dnsbl.info/dnsbl-database-check.php
dnsbl = ['bl.spamcop.net', 'cbl.abuseat.org', 'dnsbl.sorbs.net',
'zen.spamhaus.org', 'bl.mcafee.com', 'spam.spamrats.com']
tasks = asyncio.gather(
broker.find(types=['CONNECT:25'], dnsbl=dnsbl, limit=10),
save(proxies, filename='proxies.txt'))
loop = asyncio.get_event_loop()
loop.run_until_complete(tasks)
if __name__ == '__main__':
main()
"""Gather proxies from the providers without
checking and save them to a file."""
import asyncio
from proxybroker import Broker
async def save(proxies, filename):
"""Save proxies to a file."""
with open(filename, 'w') as f:
while True:
proxy = await proxies.get()
if proxy is None:
break
f.write('%s:%d\n' % (proxy.host, proxy.port))
def main():
proxies = asyncio.Queue()
broker = Broker(proxies)
tasks = asyncio.gather(broker.grab(countries=['US', 'GB'], limit=10),
save(proxies, filename='proxies.txt'))
loop = asyncio.get_event_loop()
loop.run_until_complete(tasks)
if __name__ == '__main__':
main()
"""Run a local proxy server that distributes
incoming requests to external proxies."""
import asyncio
import aiohttp
from proxybroker import Broker
async def get_pages(urls, proxy_url):
tasks = [fetch(url, proxy_url) for url in urls]
for task in asyncio.as_completed(tasks):
url, content = await task
print('Done! url: %s; content: %.100s' % (url, content))
async def fetch(url, proxy_url):
resp = None
try:
async with aiohttp.ClientSession() as session:
async with session.get(url, proxy=proxy_url) as response:
resp = await response.read()
except (aiohttp.errors.ClientOSError, aiohttp.errors.ClientResponseError,
aiohttp.errors.ServerDisconnectedError) as e:
print('Error. url: %s; error: %r' % (url, e))
finally:
return (url, resp)
def main():
host, port = '127.0.0.1', 8888 # by default
loop = asyncio.get_event_loop()
types = [('HTTP', 'High'), 'HTTPS', 'CONNECT:80']
codes = [200, 301, 302]
broker = Broker(max_tries=1, loop=loop)
# Broker.serve() also supports all arguments that are accepted
# Broker.find() method: data, countries, post, strict, dnsbl.
broker.serve(host=host, port=port, types=types, limit=10, max_tries=3,
prefer_connect=True, min_req_proxy=5, max_error_rate=0.5,
max_resp_time=8, http_allowed_codes=codes, backlog=100)
urls = ['http://httpbin.org/get', 'https://httpbin.org/get',
'http://httpbin.org/redirect/1', 'http://httpbin.org/status/404']
proxy_url = 'http://%s:%d' % (host, port)
loop.run_until_complete(get_pages(urls, proxy_url))
broker.stop()
if __name__ == '__main__':
main()
Change Log¶
`0.2.1`_ (Unreleased)¶
- Added the
--format
flag, which indicating in what format the results will be presented. - Improved the
--outfile
flag behavior. Earlier it was required to wait until whole process has finished, now the result is available in real-time #39
0.2.0 (2017-09-17)¶
- Added CLI interface
- Added
Broker.serve()
function. Now ProxyBroker can work as a proxy server that distributes incoming requests to a pool of found proxies - To available types (protocols) added:
CONNECT:80
- CONNECT method to port 80CONNECT:25
- CONNECT method to port 25 (SMTP)
- Added new options of checking and filtering proxies.
Broker.find()
method has takes new parameters:post
,strict
,dnsbl
. See documentation for more information - Added check proxies to support Cookies and Referer
- Added gzip and deflate support
Broker
attributesmax_concurrent_conn
andattempts_conn
are deprecated, usemax_conn
andmax_tries
instead.- Parameter
full
inBroker.show_stats()
is deprecated, useverbose
instead - Parameter
types
inBroker.find()
(andBroker.serve()
) from now is required ProxyChecker
renamed toChecker
.ProxyChecker
class is deprecated, useChecker
insteadProxy.avgRespTime
renamed toProxy.avg_resp_time
.Proxy.avgRespTime
is deprecated, useProxy.avg_resp_time
instead- Improved documentation
- Major refactoring
0.1.3 (2016-03-26)¶
ProxyProvider
renamed toProvider
.ProxyProvider
class is deprecated, useProvider
instead.Broker
now accepts a list of providers and judges not only as strings but also objects of classesProvider
andJudge
- Fixed bug with signal handler on Windows #4
0.1.2 (2016-02-27)¶
- Fixed bug with SIGINT on Linux
- Fixed bug with clearing the queue of proxy check.