Building a Backconnect Proxy (1/5)
Background and Ideation
Published on: 2022-02-08
This is the first in a series of 5 posts that will outline how to go from idea to product through the creation of a backconnect/rotating proxy in Golang. Before a product can be made, it will need to be envisioned. At LDG, we specialize in taking these ideas and bringing them to life.
Here in Part 1, we will dive into a brief introduction to the problem, and propose a potential solution. This potential solution of course presents several key questions about technical feasibility. In Part 2 we will validate the potential solution's technical feasibility with code examples. Part 3 will focus on the design and architecture of a potential solution. In Part 4, we'll have our first iteration of the product; an MVP (Minimum Viable Product). Finally, in Part 5, we'll expand upon the MVP with other key features, and discuss other potential areas of improvement.
A customer needed to collect data from various web sources. Having some experience with web scraping various sites prior, this task was relatively easy to accomplish. There were, however, two complicating factors;
- The sources had prohibitively restrictive rate limits that would impede the functional goals of the client.
- The collection had to be performed constantly (24/7), and prolonged outages would mean missing data, which would reduce the quality of the collected data.
As any experienced web scraper would know, restrictive rate limits can be circumvented through backconnect/rotating proxies.
A more practical example would be—say we wished to retrieve the weather from some weather site. The site allowed us to retrieve the weather for 1 city per hour. Our goal, however, is to record the weather of 10 cities each hour over the course of a year to provide some interesting comparisons or visualizations. With the site's rate limit in place, we could not achieve our functional goal. Unless, of course, we make use of backconnect proxies.
VPN vs. Proxy vs. Backconnect Proxy
Having a basic understanding of what VPNs, proxies, or backconnect proxies are, is a prerequisite for this series. Here we'll discuss what each is briefly. The remainder of this series presumes a basic understanding of proxies.
VPN (Virtual Private Network)—A virtual network that can connect clients across another public network. In certain cases, a VPN may behave like a network proxy, however, the two are distinct in that:
- VPNs can support multiple clients connecting together; whereas a proxy merely connects the clients to their specified destination, and,
- VPNs support long-lasting connections; whereas a proxy is geared towards supporting temporal connections.
Proxy—Specifically with respect to network proxies; they will forward network requests from the client to the destination as if they're their own. This response is then returned from the destination to the client. For example; a geographical proxy can trick services into accessing geographically specific content.
Backconnect / Rotating Proxy—The proxy will rotate amongst a set of IP addresses to forward the request to the final destination. For example; given a backconnect proxy that rotates between a list of 5 IP addresses, if the client were to make 5 requests through the proxy, the final destination would receive 1 request per IP. Of course, the mechanism of rotation may vary depending on the functional needs of the client.
How does a backconnect proxy solve the rate limit issue?
A rate limit will attempt to limit access to some resource by only allowing x number of requests per some y interval. An example of a rate limit would be at most 1 request per 1 second (1 r/s). The rate limiter would only apply per user, which in most cases effectively becomes per IP address. Given a 1 r/s rate limit, if I were to send 3 requests in less than 1 second, the rate limiter would permit the first and block the next 2. No requests will be permitted until a second past the first request. To make matters worse, some sites may temporarily or permanently ban an IP address for violating the rate limit.
If one could strategically delegate some of their requests to another IP address they could achieve a higher rate limit. IP 1 and IP 2 will both be limited to 1 r/s. However, we could send our first request through IP 1 and then the second request through IP 2. Neither would be rate-limited and we would achieve a request rate of 2 r/s. Adding additional IP addresses would allow us to achieve higher request rates. This is what a backconnect proxy facilitates; the delegation of requests to other IP addresses to circumvent rate limits.
Existing Proxy Services
Naturally, we picked a reasonably priced backconnect proxy service and put that in front of our web scraping script. Proxies are quite a nice solution since they can easily be dropped in between the scraping script and the destination. For example, to make a proxied request with curl you can use:
curl -x http://example-proxy.com http://ip-api.com/json
The web scraper was created as robust and reliable to ensure data integrity. With both the web scraper and backconnect proxy ready we began the collection. After a few brief tests ensuring the backconnect proxy service was correctly forwarding requests, we deployed the scraping system.
With the system live we began to monitor the performance over the course of a few weeks. Throughout this initial testing period we noticed three major issues coming from the external backconnect proxy service:
- A consistent number of failed requests. Such that around 1 in 5 requests would fail with similar errors.
- A weekly outage, where all requests sent failed. This outage would occur around 4 am for about 3-4 hours on Saturday.
- Variable latency; in some cases taking seconds for requests to complete.
The consistent number of failures appeared to be due to low-quality proxies which would either timeout, erroneously request login details, or have a rate limit error. The backconnect proxy used a whitelisted IP for authentication so login details were not necessary.
As for the requests that received rate limit errors from the destination site, this is likely due to the rare chance of IP reuse occurring. However, given the number of IP addresses provided by the proxy service, it was more likely due to the fact that the IP addresses were shared amongst the proxy service clients. Such that other client(s) were also making requests to the same destination as us, which resulted in IP addresses getting rate limited.
These failures, while inconvenient, were manageable given their relatively low rate of occurrence. Minor updates were made to the web scraper to retry requests a few times upon failure. While not ideal, this solution certainly provided higher confidence in data integrity.
The service provider confirmed the weekly outages were a mandatory scheduled maintenance period. Furthermore, they were unable to accommodate our need for constant availability.
While monitoring the web scraping, we initially noticed higher latency for requests than normal. Given the other issues with the proxy, we guessed the latency may be due to the proxy. However, we had to consider two other possible alternatives; a) the destination was simply not able to keep up with the number of requests or b) the destination had detected our attempts to proxy our requests.
We ruled out (a) since the destination was rather popular and while it wasn't perfect, the latency issues were only occurring while using the proxy. As for (b), we reviewed how our scraper was making the requests to ensure anonymity. We also noticed the latency was not entirely consistent, such that certain IP addresses would offer better performance than others.
Therefore we concluded that the proxies used were generally of poor quality. A slow proxy would defeat the purpose of our client's functional needs and render the use of the proxy redundant.
The sequence diagram below showcases the issue through sequential requests. Given a rate limit of 1 request per second (r/s), we may use a backconnect proxy to achieve 3 r/s. We would also require the backconnect proxy to have at least 3 IP addresses.
Naturally, if we wish to reach that higher request rate we would need to make those 3 requests within a second. If the proxy were to delay 2 seconds per request, without including any other delay we would be able to make 3 requests in 6 seconds (3 requests x 2 seconds).
Thread—An execution unit that performs instructions in sequence. A sequential program is executed in a single thread, vs concurrent programs which will employ 2+ threads that will work in parallel.
These proxy delays can be mitigated by introducing concurrency. Concurrency increases the complexity of the client code and may not be possible given functional needs (e.g. requests must be made sequentially). In our case, we can make use of concurrent requests. However, even if we make all 3 requests concurrently the proxy will still delay for at least 2 seconds. Therefore, if only 3 threads are allocated for making requests concurrently, we will only reach 3 requests per 2 seconds. In order to reach 3 r/s, we would need to dedicate at least 6 threads.
This analysis only considered the delay caused by the proxy, the request to the destination and the response back will also take some time as well. Suffice to say, placing a proxy into the system will undoubtedly introduce some delay. Whether that delay is too long, or manageable is informed by the functional needs. In our case, this delay along with the other 2 issues prevented us from accomplishing our functional needs.
We began shopping for alternative backconnect proxy services that could provide us with a more reliable proxy. For our search, we focused on services that could meet the following needs:
- Provide enough proxy IP addresses in rotation to by-pass the rate limits
- Provide a high level of reliability for each proxy IP.
- Minimal latency—a slow proxy would defeat the purpose of using a proxy
- Affordable price
Our needs were unique enough that we decided to investigate creating an in-house backconnect proxy. This is quite a big step since this can be high risk, low reward due to time investment vs potential payoff. In many use cases, an off the shelf service will suffice. The client supported our suggestion to proceed, as there were no services available that met their functional needs.
We began looking into the specifics of what would be required to build our bespoke solution. We identified several key technical questions to consider:
- Which programming language would fit the job?
- Can we create a proxy service?
- Can we accept requests?
- Can we forward those requests to a destination?
- Can we return the response from the destination to the client?
- Can we access multiple IP addresses?
- Can we programmatically change the IP address used for making an HTTP request?
- Can the proxy service be flexible enough to support switching a proxy request's source IP address to an alternative address programmatically?
- Since we controlled the client and proxy this is not as critical. However, we would like to closely match a standard proxy integration to not complicate matters.
In our next post we will go through each of these questions and work towards a Proof of Concept (PoC) with code.