Varnishing all my troubles away

TL;DR

  • Setting up varnish server on brand new server
  • Web page shows: “Error 503 Backend fetch failed”
  • How to investigate the problem (traced back to SELinux)
  • How to actually fix specific SELinux issue (rather than just turn SELinux off)

Varnish?

For our research work at UCL, we host a bunch of different web sites, web services and applications that run on a bunch of different ports on a bunch of different backend machines (and virtual machines). All external web requests arrive on a single IP, and we use varnish to sit on the frontline (port 80) and marshall all the incoming traffic to and from the appropriate backend server.

Varnish is a web accelerator – it sits in front of whatever is actually generating the content for your web pages and caches whatever content it deems safe to cache. The next time someone requests that same page, the content is served from the cache (fast) rather than going off and generating content from the backend (slow). So it’s often used to speed up web pages and generally reduce load on your backend databases and applications.

This is all great, but varnish also provides a really simple and flexible tool for routing HTTP traffic to different backends (which is actually the point of this post).

What’s the problem?

I eventually managed to get round to moving our frontline varnish server from a decaying machine running CentOS 4(!) to a brand new VM running CentOS 7. This allowed varnish to be upgraded from v2.0 to version 4.1.

All good.

I did get stuck with one app that wasn’t working though – the following bit of varnish config was meant to direct traffic through to a backend application, running on a backend server, listening to port 5001.

vcl 4.0;

backend my_app_live {
  .host = "xxx.xxx.xxx.xxx";
  .port = "5001";
}

sub vcl_recv {
  if ( req.http.host == "myapp.domain.com" ) {
    set req.backend_hint = my_app_live;
    return (pass);
  }
}

It was working fine on the old server, but directing my browser to the web address “myapp.domain.com” just gave me the standard Varnish error:

Error 503 Backend fetch failed

Unsurprisingly (given the error message), this message usually happens when varnish has sent a web request to a backend but it hasn’t had any response back from the server. What to do next?

Well let’s start from the backend server and work our way back to varnish…

Can I get the expected web page by sitting on the backend server and contacting the application via the local port?

$ ssh backendserver
$ curl -I http://localhost:5001/ 
HTTP/1.1 200 OK
Date: Fri, 03 Nov 2016 19:42:07 GMT
Server: Apache/2.2.3 (CentOS)
Content-Length: 5504
Connection: close
Content-Type: text/html; charset=utf-8

Yes.

Can I get the expected web page by sitting on the varnish server and contacting the application via a remote port?

$ ssh varnishserver
$ curl -I http://backendserver:5001/ 
HTTP/1.1 200 OK
Date: Fri, 03 Nov 2016 19:45:02 GMT
Server: Apache/2.2.3 (CentOS)
Content-Length: 5504
Connection: close
Content-Type: text/html; charset=utf-8

Yes.

So, what do we know so far…

  • The application seemed to be running fine on the backend server
  • I could retrieve the content directly from the application port (wget)
  • I couldn’t retrieve this content through varnish
  • I could also see that lots of other varnish rerouting was working fine

GIYF

Googling around for issues associated with “varnish” and “503” suggested that the problem might be security settings in SELinux. Which took me to a nice blog post about how to get varnish to play nicely with SELinux.

I should be honest here – for a very long time I considered any problems associated with “SELinux” to have an incredibly strong SEP field. On encountering these problems, my general practice had been to turn SELinux into permissive mode and rely on our main firewall to deal with security issues (ie SEP). As it turns out this practice wasn’t as terrible as it sounds (I checked this with our IT team and they were okay with it). However, turning off security on a brand new externally-facing server left a hacky taste in the mouth.

I figured I should actually do the right thing and learn how to play nicely with SELinux. Turns out it really wasn’t that hard.

Q: Is the problem I’m experiencing related to SELinux?

Good question. Turns out a pretty simple way to find out is looking for a sensible term (eg “varnish”) in the log file:

$ ssh varnishserver
$ sudo grep varnish /var/log/audit/audit.log

This turned up a bunch of lines:

type=AVC msg=audit(1478175339.950:37802): avc: denied { name_connect } for pid=9111 comm="varnishd" dest=5001 scontext=system_u:system_r:varnishd_t:s0 tcontext=system_u:object_r:commplex_link_port_t:s0 tclass=tcp_socket

So, yes – the words “varnish”, “denied” and “dest=5001” definitely did suggest my problem was related to SELinux permissions.

Q: How do I fix my SELinux problem (without just turning the whole thing off)?

Turns out the clever people on the interwebz have written a tool audit2allow to help troubleshoot this kind of thing. This can be installed through the setroubleshoot package (which kind of makes sense).

$ sudo yum install setroubleshoot

This tool can be used to translate the output of the audit log to a more useful message:

$ sudo grep varnishd /var/log/audit/audit.log | audit2allow -w -a

Which provides messages like:

type=AVC msg=audit(1478177584.127:38275): avc: denied { name_connect } for pid=9118 comm="varnishd" dest=5001 scontext=system_u:system_r:varnishd_t:s0 tcontext=system_u:object_r:commplex_link_port_t:s0 tclass=tcp_socket
 Was caused by:
 The boolean varnishd_connect_any was set incorrectly. 
 Description:
 Allow varnishd to connect any

Allow access by executing:
 # setsebool -P varnishd_connect_any 1

So, this was not only telling me in plain text what caused this error, but also how to “fix” it (tell SELinux that this behaviour was fine).

Now of course I read up on exactly what this command was going to do before executing it (no, really I did).

$ sudo setsebool -P varnishd_connect_any 1

So we should just need to restart the varnish server…

$ sudo systemctl restart varnish

check the page again…

$ curl -I www.myservice.com
HTTP/1.1 200 OK

Sorted.

Now I just need to add all this to the puppet configuration…