Sanitize HTML in Ruby

For my first foray into Ruby, I’ve created an HTML sanitization method. It is partially based on Brad Choate’s perl sanitize_html (used in my standalone comments and trackback package). While this was not a good exercise in learning Ruby objects, it was a good exercise in Ruby regular expressions and String replacement. With no further ado, here’s my annotated sanitize_html in Ruby:

A basic method declaration. The default set of allowed tags and attributes is provided as the default value for the okTags argument. The soloTags array contains tags that don’t require a closing tag.

1
2
3
def sanitize_html( html, okTags='a href, b, br, i, p' )
# no closing tag necessary for these
soloTags = ["br"]

We begin by building an allowed html tag hash. The hash keys are the allowed html tags and the hash values are arrays of allowed attributes for the respective tag. Here’s the blow by blow breakdown in irb:

1
irb(main):001:0> okTags = 'a href, b, br, i, p'
=> "a href, b, br, i, p"
irb(main):002:0> tags = okTags.downcase.split(',')
=> ["a href", " b", " br", " i", " p"]
irb(main):003:0> tags.collect!{ |s| s.split(' ') }
=> [["a", "href"], ["b"], ["br"], ["i"], ["p"]]
irb(main):004:0> allowed = Hash.new
=> {}
irb(main):005:0> tags.each do |s|
irb(main):006:1* key = s.shift
irb(main):007:1> allowed[key] = s
irb(main):008:1> end
=> [["href"], [], [], [], []]
irb(main):009:0> allowed
=> {"a"=>["href"], "b"=>[], "p"=>[], "br"=>[], "i"=>[]}

And here’s the corresponding code:

1
2
3
4
5
6
7
# Build hash of allowed tags with allowed attributes
tags = okTags.downcase().split(',').collect!{ |s| s.split(' ') }
allowed = Hash.new
tags.each do |s|
key = s.shift
allowed[key] = s
end

Next, we perform a substitution on all <…> elements. We specify a non-greedy, multi-line regular expression (? and m respectively).

1
2
3
# Analyze all <> elements
stack = Array.new
result = html.gsub( /(<.*?>)/m ) do | element |

It’s a closing tag. After verifying that it’s allowed and that the opening tag has already been seen, use the stack to keep tags in matched pairs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
if element =~ /\A<\/(\w+)/ then
# </tag>
tag = $1.downcase
if allowed.include?(tag) && stack.include?(tag) then
# If allowed and on the stack
# Then pop down the stack
top = stack.pop
out = "</#{top}>"
until top == tag do
top = stack.pop
out << "</#{top}>"
end
out
end

It’s a solo tag. Pass through if allowed.

1
2
3
4
5
6
elsif element =~ /\A<(\w+)\s*\/>/
# <tag />
tag = $1.downcase
if allowed.include?(tag) then
"<#{tag} />"
end

It’s an opening tag. Push it onto the stack if it requires a closing tag. Replace with a simple opening tag if there are no allowed attributes. And sweep through the matched element testing for allowed attribute-value pairs if there are allowed attributes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
  elsif element =~ /\A<(\w+)/ then
# <tag ...>
tag = $1.downcase
if allowed.include?(tag) then
if ! soloTags.include?(tag) then
stack.push(tag)
end
if allowed[tag].length == 0 then
# no allowed attributes
"<#{tag}>"
else
# allowed attributes?
out = "<#{tag}"
while ( $' =~ /(\w+)=("[^"]+")/ )
attr = $1.downcase
valu = $2
if allowed[tag].include?(attr) then
out << " #{attr}=#{valu}"
end
end
out << ">"
end
end
end
end

Our previous substitution was on matched <…> elements. Now, clean up any >’s that are prior to the first <…> element and any <’s that follow the last <…> element;

1
2
3
4
5
6
# eat up unmatched leading >
while result.sub!(/\A([^<]*)>/m) { $1 } do end

# eat up unmatched trailing <

while result.sub!(/<([^>]*)\Z/m) { $1 } do end

If there are any tags left in the stack, then append the appropriate closing tags to the string.

1
2
3
4
5
6
7
  # clean up the stack
if stack.length > 0 then
result << "</#{stack.reverse.join('></')}>"
end

result
end

April 5: Sanitize HTML in Ruby (cont)