Building a Backconnect Proxy (5/5)
Productionalization and Review
Published on: 2022-03-11
This is the fifth and final part in a series of 5 posts that will outline how to go from idea to product through the creation of a backconnect/rotating proxy in Golang. Before a product can be made, it will need to be envisioned. At LDG, we specialize in taking these ideas and bringing them to life.
In part 4 we created an MVP, in part 5 we will build out the dynamic reload of the list of available IP addresses. After which, we will discuss potential extensions of the backconnect proxy. Finally,we will give a review of our experience at Legion Development Group building out the backconnect proxy.
Dynamic Reload
To refresh, the backconnect proxy we built in part 4 successfully proxies both HTTP and HTTPS requests. To simplify the MVP build, we cut dynamically reloading the list of available IP addresses. Instead we used a hard-coded list.
rotater := &Rotater{
availableIPs: []string{
"x.x.x.1",
"x.x.x.2",
},
}
As discussed previously, a static list of available IP addresses would increase the downtime of the system. If the list of available IP addresses were to change the system would need to be stopped, compiled and run again. Instead, we can dynamically reload the list of available IP addresses as outlined in one of our PoCs from part 2.
func (r *Rotater) retreiveIPs() (ipList []string, err error) {
netInterfaces, err := net.Interfaces()
if err != nil {
return
}
for _, netInterface := range netInterfaces {
if netInterface.HardwareAddr.String() != r.targetInterfaceMAC {
continue
}
addresses, err := netInterface.Addrs()
if err != nil {
continue
}
for _, address := range addresses {
if val, ok := address.(*net.IPNet); ok {
ipList = append(ipList, val.IP.String())
}
}
}
return ipList, nil
}
The code requires us to specify the targetInterfaceMAC
address which is rather
trivial
to obtain (output condensed for brevity):
$ ip addr
5: eth0: mtu 2800 qdisc fq_codel state UNKNOWN group default qlen 1000
link/ether 5e:88:52:b0:f7:5b brd ff:ff:ff:ff:ff:ff
inet z.z.z.z/16 brd 172.22.255.255 scope global ztr2q3aihf
The network interface we are using is eth0
and the server's IP address is
z.z.z.z
. Therefore, our MAC address is 5e:88:52:b0:f7:5b
. We can then
add
that MAC address to our code as follows:
rotater := &Rotater{
targetInterfaceMAC: "5e:88:52:b0:f7:5b",
m: &sync.Mutex{},
}
With the means to retrieve the new list of available IP addresses added, we will now need to
periodically retrieve this new list to update our current list of available IP addresses. This
is
rather easy in golang using time.Ticker
:
func (r *Rotater) updateProxyIPs() {
ticker := time.NewTicker(30 * time.Minute)
defer ticker.Stop()
for {
r.updateIPs()
<-ticker.C
}
}
func (r *Rotater) updateIPs() {
ipList, err := r.retreiveIPs()
if err != nil {
panic(err)
}
if len(ipList) == 0 {
return
}
r.m.Lock()
defer r.m.Unlock()
r.availableIPs = ipList
}
func (r *Rotater) Start() {
go r.updateProxyIPs()
}
We will also update the current list of IPs from the newly retrieved IP list in
updateIPs
. As shown, we will need to use a mutex lock to ensure data consistency.
Since
we're modifying the list of available IP addresses, there is a possibility that thread interleavings
could cause an out of bounds exception.
To illustrate the potential for concurrency issues consider the following example. If we had the following code snippets executing concurrently:
Code 1
r.availableIPs = ipList
Code 2
if r.currentIndex >= len(r.availableIPs) {
r.currentIndex = 0
}
n := r.availableIPs[r.cu
Without correct locking, the execution order may occur in the following sequence:
- Code 2: lines 1-3
- Code 1: line 1
- Code 2: line 4
This execution sequence would cause an out-of-bounds panic if the initial state of this met the following conditions:
len(r.availableIPs) > len(ipList)
and
r.currentIndex >= len(ipList)
and
r.currentIndex < len(r.availableIPs)
To resolve this, use of a shared mutex prior to both Code 1 and 2 is needed. The mutex will
insure
only the following two valid execution sequences
are possible:
- Code 1: line 1
- Code 2: lines 1-4
Or:
- Code 2: lines 1-4
- Code 1: line 1
Since Code 2 performs a bounds check prior to using r.currentIndex
, we should not
receive an out-of-bounds panic. Unless, of course, r.availableIPs
is set to an
empty
list.
Concurrency issues can be rather difficult to manage, however there may be an alternative way to handle this that is more idiomatic in Go (CSP style concurrency). We’ll leave that as a task for the reader to explore for now.
With the reload for IP addresses finished, we only need to call rotater.Start()
in
the
main method. This will create a goroutine for r.updateProxyIPs
. The updated code
can be
found here.
Possible Extensions
The backconnect proxy has been fully created as we had envisioned in part 1. However, there are certainly several areas where we could improve the system overall. Besides the previously mentioned flaws, a few areas the backconnect proxy may improve are:
- Geo-Location filtering
- Potential alternative rotation algorithms
- Scaling concerns
Geo-Location Filtering
In some cases, a service may attempt to restrict access via geographic filtering. One means of performing geographic filtering is to look up an IP address's location in an Geo-IP database. A backconnect proxy can circumvent these geographic filtering by obtaining IP addresses within the permitted regions. The system would then simply ensure requests that require some specific geographical region are internally filtered to IP addresses that will be permitted.
For example, given the following IP addresses:
a.a.a.1
a.a.a.2
b.b.b.1
b.b.b.2
The IP addresses are already categorized by location; a.a.a.x
for American IP
addresses
and b.b.b.x
for British IP addresses. Imagine the target service only permitted
American IP addresses (denoted as a.a.a.x
). The backconnect proxy would then need
to
filter requests for that service to only use the American region IP addresses
(a.a.a.1
and a.a.a.2
).
Naturally, the backconnect proxy would need to be able to categorize each IP address by its geographical region. In this regard, using a Geo-IP database such as maxmind's geo db or some alternative will suffice.
Another matter to resolve is obtaining geographically disparate IP addresses. One potential solution is to leverage multiple servers in various regions. This solution naturally increases the complexity of the system. One is left to either manually manage or add an additional layer to facilitate filtering and routing requests to the different servers.
Rotation Algorithms
The current rotation algorithm is rather simple. For each request the next IP in the list of
available IP addresses is provided. This certainly accomplishes our goal however it certainly is
not
ideal. For example if we have 5 IP addresses and have a client (c1
)
collecting from a target site (t1
). c1
performs
5
requests per second (r/s) and the target (t1
) site only permits 1 r/s.
With
the given configuration, requests from c1
should succeed. However if
another
client, c2
, also makes a request to another site
(t2
) a request to t1
could get rate limited.
This
happens because ipi+1
which normally would handle request
r5
from c1
will instead be used by a request
from
c2
. Therefore the request r5
from
c1
will instead use ipi+2 which already handled a request
r1
within the last second and therefore cause a rate limit error.
One potential solution would be to run a backconnect proxy server per client. This would permit
requests to be made to t1
and t2
concurrently.
However, who is to say that c1
and c2
will be
such
considerate clients? If c2
were to make a request to
t1
then one of the clients requests will get rate limited.
Scaling Concerns
As mentioned in the previous two sections, scaling could also be an issue. The proxy is fairly capable in terms of performance. However, as mentioned in the section Geo-Location Filtering, we will need some higher layer to manage the various backconnect proxies running in different regions? Or as mentioned in the rotation algorithm section, will we need multiple backconnect proxy servers per client/target site? One possible solution would be to create a management layer that will route proxy requests based on client parameters. Naturally, this would increase the complexity of the backconnect proxy.
Review
How did it really go?
We started this blog series off by discussing a problem we at Legion Development Group encountered. A client wanted us to collect data from some sites. However, these sites had rate limits which interfered with our client's functional goals. Initially we used an off-the-shelf backconnect proxy solution. After extensive testing the off-the-shelf backconnect proxy proved unreliable and slow. Leading us to build out our own backconnect proxy in Golang.
First, we began by planning out the build by researching aspects that remained uncertain. In order to proceed with building, we need to find the appropriate resources (see part 2). Briefly, we found:
- The goproxy library, which provided most of the networking infrastructure
- A Hosting provider that sold IP addresses to bind to servers
- The programmatic means to make a request through a different IP address
Second, we began creating the architecture and design. Our system is more complicated in comparison to the backconnect proxy presented in this series. Naturally, our system addresses some of the issues discussed in the above section Possible Extensions. With the design finished we began rapid prototyping. We purchased a couple of IP addresses and deployed to an existing server. The system successfully passed our initial tests so we began to productionize our system.
Generally we build our systems with production in mind so we primarily added additional monitoring. In total, our development process took several weeks. Finally, we monitored the backconnect proxy over the next while to verify our stability. The backconnect proxy performed well, few dropped requests, and consistent data collection.
The launch wasn't all smooth however, we uncovered a pernicious memory leak that would result in Gigabytes (GB) of growth weekly. With use of golang's pprof library we were able to track down and resolve this issue. However this also required interacting directly with golang's GC (Garbage Collector). These topics are probably worth their own series in the future.
Was It Worth It?
We certainly accomplished our goal, creating a backconnect proxy that could remain reliable and still circumvent rate limits. The up-front cost was rather low and consisted of several IP addresses, and development time. In terms of maintenance cost the backconnect proxy requires minimal maintenance per year. Primarily, we merely need to add additional addresses which can be added at will without service interruption. In terms of cost comparison, the cost of running our own proxy is a fair bit lower than existing off-the-shelf solutions.
Likewise, our client was also impressed. The backconnect proxy helped maintain a consistent collection of their target data regardless of rate limits. The proxy was so successful the client expanded data collection scope to include several new websites. Supporting these new websites merely required some additional IP addresses to be purchased and dynamically loaded into the live service. Furthermore, other clients were able to make use of the backconnect proxy to perform a similar data collection for different sites. In fact, the backconnect proxy system is in use right now by several clients.
Conclusion
Over this series we have gone from an idea to a product for a backconnect proxy. In this part we completed the original design, discussed a few additional features limitations and finally reviewed the actual implementation of Legion Development Group's backconnect proxy service.
If you'd like to view code samples from the entire series, you can view our GitHub repo. If you have any questions please feel free to contact us. At Legion Development Group, we are ready to help you take your idea and make it a reality. We have a wide range of experience and are ready to solve your difficult problems.