Mar 17 2005

Sanitize HTML in Ruby

For my first foray into Ruby, I’ve created an HTML sanitization method. It is partially based on Brad Choate’s perl sanitize_html (used in my standalone comments and trackback package). While this was not a good exercise in learning Ruby objects, it was a good exercise in Ruby regular expressions and String replacement. With no further ado, here’s my annotated sanitize_html in Ruby:

A basic method declaration. The default set of allowed tags and attributes is provided as the default value for the okTags argument. The soloTags array contains tags that don’t require a closing tag.

1
2
3

def sanitize_html( html, okTags='a href, b, br, i, p' )
  # no closing tag necessary for these
  soloTags = ["br"]

We begin by building an allowed html tag hash. The hash keys are the allowed html tags and the hash values are arrays of allowed attributes for the respective tag. Here’s the blow by blow breakdown in irb:

irb(main):001:0> okTags = 'a href, b, br, i, p'
=> "a href, b, br, i, p"
irb(main):002:0> tags = okTags.downcase.split(',')
=> ["a href", " b", " br", " i", " p"]
irb(main):003:0> tags.collect!{ |s| s.split(' ') }
=> [["a", "href"], ["b"], ["br"], ["i"], ["p"]]
irb(main):004:0> allowed = Hash.new
=> {}
irb(main):005:0> tags.each do |s|
irb(main):006:1* key = s.shift
irb(main):007:1> allowed[key] = s
irb(main):008:1> end
=> [["href"], [], [], [], []]
irb(main):009:0> allowed
=> {"a"=>["href"], "b"=>[], "p"=>[], "br"=>[], "i"=>[]}

And here’s the corresponding code:

# Build hash of allowed tags with allowed attributes
tags = okTags.downcase().split(',').collect!{ |s| s.split(' ') }
allowed = Hash.new
tags.each do |s|
  key = s.shift
  allowed[key] = s
end

Next, we perform a substitution on all <…> elements. We specify a non-greedy, multi-line regular expression (? and m respectively).

1
2
3

# Analyze all <> elements
stack = Array.new
result = html.gsub( /(<.*?>)/m ) do | element |

It’s a closing tag. After verifying that it’s allowed and that the opening tag has already been seen, use the stack to keep tags in matched pairs.

if element =~ /\A<\/(\w+)/ then
  # </tag>
  tag = $1.downcase
  if allowed.include?(tag) && stack.include?(tag) then
    # If allowed and on the stack
    # Then pop down the stack
    top = stack.pop
    out = "</#{top}>"
    until top == tag do
      top = stack.pop
      out << "</#{top}>"
    end
    out
  end

It’s a solo tag. Pass through if allowed.

elsif element =~ /\A<(\w+)\s*\/>/
  # <tag />
  tag = $1.downcase
  if allowed.include?(tag) then
    "<#{tag} />"
  end

It’s an opening tag. Push it onto the stack if it requires a closing tag. Replace with a simple opening tag if there are no allowed attributes. And sweep through the matched element testing for allowed attribute-value pairs if there are allowed attributes.

  elsif element =~ /\A<(\w+)/ then
    # <tag ...>
    tag = $1.downcase
    if allowed.include?(tag) then
      if ! soloTags.include?(tag) then
        stack.push(tag)
      end
      if allowed[tag].length == 0 then
        # no allowed attributes
        "<#{tag}>"
      else
        # allowed attributes?
        out = "<#{tag}"
        while ( $' =~ /(\w+)=("[^"]+")/ )
          attr = $1.downcase
          valu = $2
          if allowed[tag].include?(attr) then
            out << " #{attr}=#{valu}"
          end
        end
        out << ">"
      end
    end
  end
end

Our previous substitution was on matched <…> elements. Now, clean up any >’s that are prior to the first <…> element and any <’s that follow the last <…> element;

# eat up unmatched leading >
while result.sub!(/\A([^<]*)>/m) { $1 } do end

# eat up unmatched trailing <

while result.sub!(/<([^>]*)\Z/m) { $1 } do end

If there are any tags left in the stack, then append the appropriate closing tags to the string.

  # clean up the stack
  if stack.length > 0 then
    result << "</#{stack.reverse.join('></')}>"
  end

  result
end

April 5: Sanitize HTML in Ruby (cont)

Take the First Step

Dwight Shih's Soap Box on the Internet Commons

Sanitize HTML in Ruby