Building a Backconnect Proxy (5/5)

Productionalization and Review

This is the fifth and final part in a series of 5 posts that will outline how to go from idea to product through the creation of a backconnect/rotating proxy in Golang. Before a product can be made, it will need to be envisioned. At LDG, we specialize in taking these ideas and bringing them to life.

In part 4 we created an MVP, in part 5 we will build out the dynamic reload of the list of available IP addresses. After which, we will discuss potential extensions of the backconnect proxy. Finally,we will give a review of our experience at Legion Development Group building out the backconnect proxy.

Dynamic Reload

To refresh, the backconnect proxy we built in part 4 successfully proxies both HTTP and HTTPS requests. To simplify the MVP build, we cut dynamically reloading the list of available IP addresses. Instead we used a hard-coded list.


rotater := &Rotater{
    availableIPs: []string{
        "x.x.x.1",
        "x.x.x.2",
    },
}

As discussed previously, a static list of available IP addresses would increase the downtime of the system. If the list of available IP addresses were to change the system would need to be stopped, compiled and run again. Instead, we can dynamically reload the list of available IP addresses as outlined in one of our PoCs from part 2.


func (r *Rotater) retreiveIPs() (ipList []string, err error) {
  netInterfaces, err := net.Interfaces()
  if err != nil {
    return
  }

  for _, netInterface := range netInterfaces {
    if netInterface.HardwareAddr.String() != r.targetInterfaceMAC {
      continue
    }
    addresses, err := netInterface.Addrs()
    if err != nil {
      continue
    }
    for _, address := range addresses {
      if val, ok := address.(*net.IPNet); ok {
        ipList = append(ipList, val.IP.String())
      }
    }
  }
  return ipList, nil
}

The code requires us to specify the targetInterfaceMAC address which is rather trivial to obtain (output condensed for brevity):


$ ip addr
  5: eth0:  mtu 2800 qdisc fq_codel state UNKNOWN group default qlen 1000
  link/ether 5e:88:52:b0:f7:5b brd ff:ff:ff:ff:ff:ff
  inet z.z.z.z/16 brd 172.22.255.255 scope global ztr2q3aihf

The network interface we are using is eth0 and the server's IP address is z.z.z.z. Therefore, our MAC address is 5e:88:52:b0:f7:5b. We can then add that MAC address to our code as follows:


rotater := &Rotater{
  targetInterfaceMAC: "5e:88:52:b0:f7:5b",
  m:                  &sync.Mutex{},
}

With the means to retrieve the new list of available IP addresses added, we will now need to periodically retrieve this new list to update our current list of available IP addresses. This is rather easy in golang using time.Ticker:


func (r *Rotater) updateProxyIPs() {
  ticker := time.NewTicker(30 * time.Minute)
  defer ticker.Stop()
  for {
    r.updateIPs()
    <-ticker.C
  }
}

func (r *Rotater) updateIPs() {
  ipList, err := r.retreiveIPs()
  if err != nil {
    panic(err)
  }

  if len(ipList) == 0 {
    return
  }

  r.m.Lock()
  defer r.m.Unlock()
  r.availableIPs = ipList
}

func (r *Rotater) Start() {
  go r.updateProxyIPs()
}

We will also update the current list of IPs from the newly retrieved IP list in updateIPs. As shown, we will need to use a mutex lock to ensure data consistency. Since we're modifying the list of available IP addresses, there is a possibility that thread interleavings could cause an out of bounds exception.

To illustrate the potential for concurrency issues consider the following example. If we had the following code snippets executing concurrently:

Code 1

r.availableIPs = ipList

Code 2


if r.currentIndex >= len(r.availableIPs) {
  r.currentIndex = 0
}
n := r.availableIPs[r.cu

Without correct locking, the execution order may occur in the following sequence:

Code 2: lines 1-3
Code 1: line 1
Code 2: line 4

This execution sequence would cause an out-of-bounds panic if the initial state of this met the following conditions:

len(r.availableIPs) > len(ipList) and r.currentIndex >= len(ipList) and r.currentIndex < len(r.availableIPs)

To resolve this, use of a shared mutex prior to both Code 1 and 2 is needed. The mutex will insure only the following two valid execution sequences are possible:

Code 1: line 1
Code 2: lines 1-4

Or:

Code 2: lines 1-4
Code 1: line 1

Since Code 2 performs a bounds check prior to using r.currentIndex, we should not receive an out-of-bounds panic. Unless, of course, r.availableIPs is set to an empty list.

Concurrency issues can be rather difficult to manage, however there may be an alternative way to handle this that is more idiomatic in Go (CSP style concurrency). We’ll leave that as a task for the reader to explore for now.

With the reload for IP addresses finished, we only need to call rotater.Start() in the main method. This will create a goroutine for r.updateProxyIPs. The updated code can be found here.

Possible Extensions

The backconnect proxy has been fully created as we had envisioned in part 1. However, there are certainly several areas where we could improve the system overall. Besides the previously mentioned flaws, a few areas the backconnect proxy may improve are:

Geo-Location filtering
Potential alternative rotation algorithms
Scaling concerns

Geo-Location Filtering

In some cases, a service may attempt to restrict access via geographic filtering. One means of performing geographic filtering is to look up an IP address's location in an Geo-IP database. A backconnect proxy can circumvent these geographic filtering by obtaining IP addresses within the permitted regions. The system would then simply ensure requests that require some specific geographical region are internally filtered to IP addresses that will be permitted.

For example, given the following IP addresses:

a.a.a.1
a.a.a.2
b.b.b.1
b.b.b.2

The IP addresses are already categorized by location; a.a.a.x for American IP addresses and b.b.b.x for British IP addresses. Imagine the target service only permitted American IP addresses (denoted as a.a.a.x). The backconnect proxy would then need to filter requests for that service to only use the American region IP addresses (a.a.a.1 and a.a.a.2).

Naturally, the backconnect proxy would need to be able to categorize each IP address by its geographical region. In this regard, using a Geo-IP database such as maxmind's geo db or some alternative will suffice.

Another matter to resolve is obtaining geographically disparate IP addresses. One potential solution is to leverage multiple servers in various regions. This solution naturally increases the complexity of the system. One is left to either manually manage or add an additional layer to facilitate filtering and routing requests to the different servers.

Rotation Algorithms

The current rotation algorithm is rather simple. For each request the next IP in the list of available IP addresses is provided. This certainly accomplishes our goal however it certainly is not ideal. For example if we have 5 IP addresses and have a client (c₁) collecting from a target site (t₁). c₁ performs 5 requests per second (r/s) and the target (t₁) site only permits 1 r/s. With the given configuration, requests from c₁ should succeed. However if another client, c₂, also makes a request to another site (t₂) a request to t₁ could get rate limited. This happens because ip_i+1 which normally would handle request r₅ from c₁ will instead be used by a request from c₂. Therefore the request r₅ from c₁ will instead use ipi+2 which already handled a request r₁ within the last second and therefore cause a rate limit error.

One potential solution would be to run a backconnect proxy server per client. This would permit requests to be made to t₁ and t₂ concurrently. However, who is to say that c₁ and c₂ will be such considerate clients? If c₂ were to make a request to t₁ then one of the clients requests will get rate limited.

Scaling Concerns

As mentioned in the previous two sections, scaling could also be an issue. The proxy is fairly capable in terms of performance. However, as mentioned in the section Geo-Location Filtering, we will need some higher layer to manage the various backconnect proxies running in different regions? Or as mentioned in the rotation algorithm section, will we need multiple backconnect proxy servers per client/target site? One possible solution would be to create a management layer that will route proxy requests based on client parameters. Naturally, this would increase the complexity of the backconnect proxy.

Review

How did it really go?

We started this blog series off by discussing a problem we at Legion Development Group encountered. A client wanted us to collect data from some sites. However, these sites had rate limits which interfered with our client's functional goals. Initially we used an off-the-shelf backconnect proxy solution. After extensive testing the off-the-shelf backconnect proxy proved unreliable and slow. Leading us to build out our own backconnect proxy in Golang.

First, we began by planning out the build by researching aspects that remained uncertain. In order to proceed with building, we need to find the appropriate resources (see part 2). Briefly, we found:

The goproxy library, which provided most of the networking infrastructure
A Hosting provider that sold IP addresses to bind to servers
The programmatic means to make a request through a different IP address

Second, we began creating the architecture and design. Our system is more complicated in comparison to the backconnect proxy presented in this series. Naturally, our system addresses some of the issues discussed in the above section Possible Extensions. With the design finished we began rapid prototyping. We purchased a couple of IP addresses and deployed to an existing server. The system successfully passed our initial tests so we began to productionize our system.

Generally we build our systems with production in mind so we primarily added additional monitoring. In total, our development process took several weeks. Finally, we monitored the backconnect proxy over the next while to verify our stability. The backconnect proxy performed well, few dropped requests, and consistent data collection.

The launch wasn't all smooth however, we uncovered a pernicious memory leak that would result in Gigabytes (GB) of growth weekly. With use of golang's pprof library we were able to track down and resolve this issue. However this also required interacting directly with golang's GC (Garbage Collector). These topics are probably worth their own series in the future.

Was It Worth It?

We certainly accomplished our goal, creating a backconnect proxy that could remain reliable and still circumvent rate limits. The up-front cost was rather low and consisted of several IP addresses, and development time. In terms of maintenance cost the backconnect proxy requires minimal maintenance per year. Primarily, we merely need to add additional addresses which can be added at will without service interruption. In terms of cost comparison, the cost of running our own proxy is a fair bit lower than existing off-the-shelf solutions.

Likewise, our client was also impressed. The backconnect proxy helped maintain a consistent collection of their target data regardless of rate limits. The proxy was so successful the client expanded data collection scope to include several new websites. Supporting these new websites merely required some additional IP addresses to be purchased and dynamically loaded into the live service. Furthermore, other clients were able to make use of the backconnect proxy to perform a similar data collection for different sites. In fact, the backconnect proxy system is in use right now by several clients.

Conclusion

Over this series we have gone from an idea to a product for a backconnect proxy. In this part we completed the original design, discussed a few additional features limitations and finally reviewed the actual implementation of Legion Development Group's backconnect proxy service.

If you'd like to view code samples from the entire series, you can view our GitHub repo. If you have any questions please feel free to contact us. At Legion Development Group, we are ready to help you take your idea and make it a reality. We have a wide range of experience and are ready to solve your difficult problems.

Part 4: Building an MVP