Writing custom sanitization rules with HtmlSanitizeEx

by Paweł Świątkowski
03 Sep 2016

Sanitizing input in web applications is a common task, but usually is connected with security. Of course, this is an important part - letting the users to submit malicious HTML into your page can be deadly. Many frameworks support such kind of sanitization out-of-the box (Ruby on Rails, Hanami, Phoenix) and you have to explicitly mark an HTML string as raw to render it without escaping.

However, allowing HTML input is frequently a requirement, especially when using rich text editors (RTEs) on frontend side. In this case, security is one side of the problem, but the other - almost equally important - is to only let users use some tags. For example, I usually forbid them to use images, because they have a tendency to break layouts. Disabling image input in RTE is not enough. Sooner or later users will discover that they are able to send arbitrary HTML to server.

In Elixir, there is a very good package for HTML sanitization, namely HtmlSanitizeEx. It includes a nice set of built-in sanitization rules. For example this will allow some “basic HTML” tags to be render and will filter out everything else:

HtmlSanitizeEx.basic_html(input_html)

This is good for general input sanitization, but does not necessarily match the subset of HTML I allow in my RTE. Fortunately, writing own rule set (or Scrubber, how package authors like to call it) is really easy. For example, say I only want to allow p, strong, em and a. The Scrubber looks like that:

defmodule Scrubber do
  require HtmlSanitizeEx.Scrubber.Meta
  alias HtmlSanitizeEx.Scrubber.Meta

  Meta.remove_cdata_sections_before_scrub
  Meta.strip_comments

  Meta.allow_tag_with_uri_attributes   "a", ["href"], ["http", "https"]
  Meta.allow_tag_with_these_attributes "a", ["name", "title"]

  Meta.allow_tag_with_these_attributes "strong", []
  Meta.allow_tag_with_these_attributes "em", []
  Meta.allow_tag_with_these_attributes "p", []

  Meta.strip_everything_not_covered
end

Note that additionally we can restrict URI schemes we allow.

To use it we define a sanitize method. We can put it in a common module with a Scrubber:

defmodule MyApp.Sanitizer do
  defmodule Scrubber do
    [...]
  end

  def sanitize(html) do
    html |> HtmlSanitizeEx.Scrubber.scrub(Scrubber)
  end
 end

Than we are able to use it (and test it):

defmodule SanitizerTest do
  use ExUnit.Case, async: true

  test 'disallow headings' do
    html = "<h1>1</h1><h2>2</h2><h3>3</h3><h4>4</h4><h5>5</h5><h6>6</h6>"
    sanitized = MyApp.Sanitizer.sanitize(html)
    assert sanitized = "123456"
  end

  test 'disallow ftp urls' do
    html = "<p>This is <a href=\"ftp://ftp.google.com/test\">FTP test</a></p>";
    sanitized = MyApp.Sanitizer.sanitize(html)
    assert sanitized == "<p>This is <a>FTP test</a></p>"
  end
end

This also reveals one downside: when you filter URI schema, you might end up with empty a tag. This is how HtmlSanitizeEx currently works and I can actually accept that. But still, I feel very good that it was so easy to adjust sanitizer’s settings to my needs.

If you need inspiration for writing your own sanitization rules, check out built in ones on GitHub.

end of the article

Tags: elixir

This article was written by me – Paweł Świątkowski – on 03 Sep 2016. I'm on Fediverse (Ruby-flavoured account, Elixir-flavoured account) and also kinda on Twitter. Let's talk.

Related posts: