Go to content Go to sidebar

Sanitize HTML in Ruby

For my first foray into Ruby, I've created an HTML sanitization method. It is partially based on Brad Choate's perl sanitize_html (used in my standalone comments and trackback package). While this was not a good exercise in learning Ruby objects, it was a good exercise in Ruby regular expressions and String replacement.

With no further ado, here's my annotated sanitize_html in Ruby:

A basic method declaration. The default set of allowed tags and attributes is provided as the default value for the okTags argument. The soloTags array contains tags that don't require a closing tag.


def sanitize_html( html, okTags='a href, b, br, i, p' )
  # no closing tag necessary for these
  soloTags = ["br"]

We begin by building an allowed html tag hash. The hash keys are the allowed html tags and the hash values are arrays of allowed attributes for the respective tag. Here's the blow by blow breakdown in irb:

irb(main):001:0> okTags = 'a href, b, br, i, p'
=> "a href, b, br, i, p"
irb(main):002:0> tags = okTags.downcase.split(',')
=> ["a href", " b", " br", " i", " p"]
irb(main):003:0> tags.collect!{ |s| s.split(' ') }
=> [["a", "href"], ["b"], ["br"], ["i"], ["p"]]
irb(main):004:0> allowed = Hash.new
=> {}
irb(main):005:0> tags.each do |s|
irb(main):006:1* key = s.shift
irb(main):007:1> allowed[key] = s
irb(main):008:1> end
=> [["href"], [], [], [], []]
irb(main):009:0> allowed
=> {"a"=>["href"], "b"=>[], "p"=>[], "br"=>[], "i"=>[]}

And here's the corresponding code:


  # Build hash of allowed tags with allowed attributes
  tags = okTags.downcase().split(',').collect!{ |s| s.split(' ') }
  allowed = Hash.new
  tags.each do |s|
    key = s.shift
    allowed[key] = s
  end

Next, we perform a substitution on all <…> elements. We specify a non-greedy, multi-line regular expression (? and m respectively).


  # Analyze all <> elements
  stack = Array.new
  result = html.gsub( /(<.*?>)/m ) do | element |

It's a closing tag. After verifying that it's allowed and that the opening tag has already been seen, use the stack to keep tags in matched pairs.


    if element =~ /\A<\/(\w+)/ then
      # </tag>
      tag = $1.downcase
      if allowed.include?(tag) && stack.include?(tag) then
        # If allowed and on the stack
        # Then pop down the stack
        top = stack.pop
        out = "</#{top}>"
        until top == tag do
          top = stack.pop
          out << "</#{top}>"
        end
        out
      end

It's a solo tag. Pass through if allowed.


    elsif element =~ /\A<(\w+)\s*\/>/
      # <tag />
      tag = $1.downcase
      if allowed.include?(tag) then
        "<#{tag} />"
      end

It's an opening tag. Push it onto the stack if it requires a closing tag. Replace with a simple opening tag if there are no allowed attributes. And sweep through the matched element testing for allowed attribute-value pairs if there are allowed attributes.


    elsif element =~ /\A<(\w+)/ then
      # <tag ...>
      tag = $1.downcase
      if allowed.include?(tag) then
        if ! soloTags.include?(tag) then
          stack.push(tag)
        end
        if allowed[tag].length == 0 then
          # no allowed attributes
          "<#{tag}>"
        else
          # allowed attributes?
          out = "<#{tag}"
          while ( $' =~ /(\w+)=("[^"]+")/ )
            attr = $1.downcase
            valu = $2
            if allowed[tag].include?(attr) then
              out << " #{attr}=#{valu}"
            end
          end
          out << ">"
        end
      end
    end
  end

Our previous substitution was on matched <…> elements. Now, clean up any >'s that are prior to the first <…> element and any <'s that follow the last <…> element;


  # eat up unmatched leading >
  while result.sub!(/\A([^<]*)>/m) { $1 } do end

# eat up unmatched trailing < while result.sub!(/<([^>]*)\Z/m) { $1 } do end

If there are any tags left in the stack, then append the appropriate closing tags to the string.


  # clean up the stack
  if stack.length > 0 then
    result << "</#{stack.reverse.join('></')}>"
  end

result end

22 Mar: sanitize_html is available under the artistic MIT license. A download package will be available shortly.

5 Apr: download sanitize.rb