Major fix for HTML emails + breaklines!

Discussion in 'Developing, APIs and extending' started by Marvin Herbold, Feb 11, 2012.

  1. Marvin Herbold

    Marvin Herbold Established Member

    Recently we decided we'd prefer to process the HTML version of incoming emails rather than the text version. So, I went into Admin CP > Mail Parser > Settings and switched "Email Content Priority" to HTML.

    At first glance this seemed to work - at least for the first email ticket post. After some replies to the ticket it was quickly evident that things are going horribly wrong... because we are using breaklines. The problem is that when one of the breakline rules go into effect, it literally strips the rest of the HTML off the email... which results in seriously invalid HTML with many tags left open. When you view the ticket in the Staff CP, the broken HTML would make all the ticket posts below it render wrong.

    There were other problems that I observed, such as <br /> tags being put in wherever there is a newline in the email... this works for text emails but is wrong for HTML.

    So I pretty much spent the whole day analyzing and fixing these problems. Here is what I came up with.

    I am using Kayako Resolve version 4.40.833.

    At line 712 in __modules/tickets/library/Ticket/class.SWIFT_TicketPost.php (right before the SWIFT-1377 bug fix comment) do this:

    Code:
            //
            // MARVIN: breaklines code leave behind invalid html with open tags
            //
            if ($_isContentHTML) {
                $_contents = preg_replace('/\r?\n/', '', $_contents);
                preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $_contents, $_matchResult);
                $_openedTags = array_reverse($_matchResult[1]);
                preg_match_all('#</([a-z]+)>#iU', $_contents, $_matchResult);
                $_closedTags = array_reverse($_matchResult[1]);
     
                foreach ($_closedTags as $_closedTagKey => $_closedTag) {
                    foreach ($_openedTags as $_openedTagKey => $_openedTag) {
                        if (strcasecmp($_closedTag, $_openedTag) == 0) {
                            unset($_closedTags[$_closedTagKey]);
                            unset($_openedTags[$_openedTagKey]);
                            break;
                        }
                    }
                }
             
                $_tagsWeDontCareAbout = array("br", "meta");
                foreach ($_openedTags as $openedTag) {
                    if (!in_array(strtolower($openedTag), $_tagsWeDontCareAbout)) {
                        $_contents .= '</' . $openedTag . '>';
                    }
                }
            }
            //
            // MARVIN: end of custom code
            //
    
    This fixes both the <br> tags getting put in (because it removes all the newlines from HTML so the nl2br function which gets called later on doesn't have anything to work with) and closes all the open tags in the ticket post.

    I tested this for a bit and it seems to work well. If you try this and see problems let me know about them! Ideally Kayako would be using HTMLPurifier to fix the open tags and remove disallowed html tags, and in fact I do see some HTMLPurifier code in there but it is all disabled. Maybe one day they will get that working and this code fix would no longer be necessary.

    As a bonus if you add this to the top of your __swift/themes/admin_default/css.tpl file:

    Code:
    /* MARVIN: ms office classes */
    .MsoNormal {
        MARGIN: 0;
    }
    /* MARVIN: end custom code */
    
    Now Office emails render nicely without extra spaces between the lines. This works even if you don't have "class" as one of your allowable html tag attributes due to another bug in kayako which I posted about somewhere around these forums recently. Don't forget to go into Admin CP > Diagnostics > Rebuild Cache in order to have the changes to the css.tpl file take effect.

    Let me know if you find this useful!
     
    Jamie Edwards and garyGBM like this.
  2. Marvin Herbold

    Marvin Herbold Established Member

    Oh and it might help to share what I use for my "Allowable HTML Tags" and "Allowable HTML Tag Attributes":

    Allowed Tags:
    hr,address,div,h1,h2,h3,h4,h5,h6,p,blockquote,pre,a,area,br,bdo,font,img,map,span,table,th,tr,td,caption,thead,tbody,tfoot,dl,dd,dt,li,ol,ul,strong,em,b,big,i,ins,u,del,s,abbr,acronym,cite,dfn,em,kbd,code,samp,strong,var

    Allowed Tag Attributes:
    href,rel,src,style,width,align,cellpadding,cellspacing,border,colspan,height,type

    Basically I allow pretty much everything for tags except form, input, button and html, head, body, title, etc...
     
    Jamie Edwards likes this.
  3. Marvin Herbold

    Marvin Herbold Established Member

    Here is a slight update to my original custom code:

    Code:
            //
            // MARVIN: breaklines code leave behind invalid html with open tags
            //
            if ($_isContentHTML) {
                $_contents = preg_replace('/\r?\n/', '', $_contents);
                preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $_contents, $_matchResult);
                $_openedTags = array_reverse($_matchResult[1]);
                preg_match_all('#</([a-z]+)>#iU', $_contents, $_matchResult);
                $_closedTags = array_reverse($_matchResult[1]);
     
                foreach ($_closedTags as $_closedTagKey => $_closedTag) {
                    foreach ($_openedTags as $_openedTagKey => $_openedTag) {
                        if (strcasecmp($_closedTag, $_openedTag) == 0) {
                            unset($_closedTags[$_closedTagKey]);
                            unset($_openedTags[$_openedTagKey]);
                            break;
                        }
                    }
                }
             
                $_tagsWeDontCareAbout = array("br", "meta");
                foreach ($_openedTags as $_openedTag) {
                    if (!in_array(strtolower($_openedTag), $_tagsWeDontCareAbout)) {
                        $_closeTheTag = true;
                        if (substr($_contents, -1) == '>') {
                            $_lastTagPosition = strrpos($_contents, '<');
                            if ($_lastTagPosition !== false) {
                                $_lastTag = substr($_contents, $_lastTagPosition);
                                if (strncasecmp($_lastTag, '<' . $_openedTag, strlen($_openedTag) + 1) == 0) {
                                    $_contents = substr($_contents, 0, $_lastTagPosition);
                                    $_closeTheTag = false;
                                }
                            }
                        }
                        if ($_closeTheTag) {
                            $_contents .= '</' . $_openedTag . '>';
                        }
                    }
                }
            }
            //
            // MARVIN: end of custom code
            //
    
    It does mainly the same thing (closes open tags left dangling by breaklines chomping off half the email body) except it's a little smarter - instead of just closing tags, it will get rid of any open tags completely if closing it would result in something that does nothing. For example:

    <div><p>Hello!</p><div><p>(breakline here... rest of html is gone)

    Becomes:

    <div><p>Hello!</p></div> <-- new and improved code does this

    Instead of:

    <div><p>Hello!</p><div><p></p></div></div> <-- what my original code would have done

    Enjoy.
     
    Jamie Edwards likes this.
  4. Marvin Herbold

    Marvin Herbold Established Member

    Another update - I noticed that the complete history that goes out in emails (if you have this feature enabled) does NOT go through the same html sanitizing process. This is a problem. I see all kinds of bad tags like <body>, <html>, etc... and the history entries have html tags left open due to the breaklines - causing the history to look funky in emails. Well... despair no more, here is the fix:

    __modules/tickets/library/Ticket/class.SWIFT_TicketPost.php around line 1037:

    Change:
    Code:
            if ($_isHTML == true)
            {
                $_textContents = strip_tags(br2nl($_contents));
                $_htmlContents = $_contents;
            } else {
                $_textContents = $_contents;
                $_htmlContents = nl2br(htmlspecialchars($_contents));
            }
    To:
    Code:
            if ($_isHTML == true)
            {
                $_textContents = strip_tags(br2nl($_contents));
                $_htmlContents = $_contents;
                //
                // MARVIN: sanitize the html
                //
                $_htmlContents = self::GetParsedContents($_htmlContents, 'strip', true);
                //
                // MARVIN: end of custom code
                //
            } else {
                $_textContents = $_contents;
                $_htmlContents = nl2br(htmlspecialchars($_contents));
            }
    This makes each post in the history to go through the same HTML sanitizing process as when the post is viewed in the staff cp. I force it to use the 'strip' mode, regardless of your actual setting in the admin cp - mainly because that is what we use and I'm too lazy to figure out which variable to pull from to get the correct setting. ;)
     
    Jamie Edwards likes this.
  5. Marvin Herbold

    Marvin Herbold Established Member

    Hopefully someone from Kayako is looking at this - it would probably make sense for this to be included in Kayako's code base.
     
    Jamie Edwards likes this.
  6. Jamie Edwards

    Jamie Edwards Chief Limey Staff Member

    Hi Marvin,

    That's very kind of you to share, thank you.

    We've been working on integrating HTML Purifier on the slow cooker, which will achieve this kind of thing but will be much more robust to variations in HTML and email format. The problem we have with it right now is memory usage - it is a monster, but progress is being made on that.

    We'll certainly take a look at this code, though!
     
  7. Marvin Herbold

    Marvin Herbold Established Member

    Yep ultimately HTMLPurifier is the way to go. Until then, I hope this is of use to you. :)

    We just went live with these code changes (until now it was only on our dev server). Seems to be holding up so far. The ticket posts from users look so much better now.
     
  8. Marvin Herbold

    Marvin Herbold Established Member

    Here is a big update! I threw away my original code - my original code was running under the assumption that perfectly valid HTML emails were coming in and the breaklines were corrupting the HTML by stripping the bottom half of the HTML clean off without closing tags. My mistake was assuming that perfectly valid HTML emails were coming in. Yesterday I noticed a few ticket posts were rendering weird in the staff cp. I looked at the original HTML emails... and... oh... my... god... seriously? I couldn't believe it... the original email had completely invalid HTML in it - closing tags where an opening tag did not even exist, etc... wow.

    OK... so round 2... the gloves come off...

    I wrote a complete html parser from scratch - but this is not your typical html parser... it's broken-HTML aware and fixes tags where it can. I just made my new code live on our support site and browsed through hundreds of support ticket. Every one of them render perfectly in the staff cp now. That's not to say my code is perfect - I do plan to make a few updates to it. For example <table><tr><td><td></table> is perfectly valid HTML - my parser tries to "fix" this and outputs <table><tr><td><td></td></td></tr></table>. The true "fix" is actually <table><tr><td></td><td></td></tr></table> (the placement of the </td>'s are different). However, this does not cause any problems rendering in the staff cp.

    Jamie - you have my permission to include this in the official Kayako code and improve on it if you wish.

    Add the following functions to __modules/tickets/library/Ticket/class.SWIFT_TicketPost.php:

    Code:
       
        private static function ParseHtml(& $_contents, $_length, & $_offset, & $_currentTag)
        {
            $_singletonTags = array('area', 'base', 'br', 'col', 'command', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source');
           
            while ($_offset < $_length)
            {
                if ($_contents[$_offset] == '<')
                {
                    if (substr_compare($_contents, '<!--', $_offset, 4) == 0)
                    {
                        if (!preg_match('#<!--.*-->#U', $_contents, $_matches, 0, $_offset))
                        {
                            $_offset = $_length;
                            return;
                        }
                       
                        $_offset += strlen($_matches[0]);
                    }
                    else if (substr_compare($_contents, '</', $_offset, 2) == 0)
                    {
                        if (!preg_match('#</([a-z:]+)( .*)?>#iU', $_contents, $_matches, 0, $_offset))
                        {
                            $_offset = $_length;
                            return;
                        }
                       
                        $_tagName = strtolower($_matches[1]);
                       
                        if ($_tagName == $_currentTag->_name)
                        {
                            $_offset += strlen($_matches[0]);
                            $_currentTag->_hasCloseTag = true;
                            return;
                        }
                        else
                        {
                            $_parentTag = $_currentTag->_parentTag;
                           
                            while ($_parentTag !== null)
                            {
                                if ($_parentTag->_name == $_tagName)
                                {
                                    return;
                                }
                               
                                $_parentTag = $_parentTag->_parentTag;
                            }
                           
                            $_offset += strlen($_matches[0]);
                        }
                    }
                    else
                    {
                        if (!preg_match('#<([a-z:]+)( .*)?(/)?>#iU', $_contents, $_matches, 0, $_offset))
                        {
                            $_offset = $_length;
                            return;
                        }
                       
                        $_offset += strlen($_matches[0]);
                       
                        $_newTag = new HtmlTag($_currentTag, strtolower($_matches[1]));
                       
                        $_currentTag->_data[] = $_newTag;
                       
                        $_newTag->_hasOpenTag = true;
                       
                        if (isset($_matches[2]))
                        {
                            $_newTag->_parameters = trim($_matches[2]);
                        }
                       
                        if (isset($_matches[3]) && ($_matches[3] == '/'))
                        {
                            $_newTag->_isSelfClosing = true;
                        }
                        else
                        {
                            if (in_array($_newTag->_name, $_singletonTags))
                            {
                                $_newTag->_isSingletonTag = true;
                            }
                            else
                            {
                                self::ParseHtml($_contents, $_length, $_offset, $_newTag);
                            }
                        }
                    }
                }
                else
                {
                    if (!preg_match('#(.*)<#U', $_contents, $_matches, 0, $_offset))
                    {
                        $_offset = $_length;
                        return;
                    }
                   
                    $_offset += strlen($_matches[1]);
                   
                    $_currentTag->_data[] = $_matches[1];
                }
            }
        }
       
        private static function FindTag($_currentTag, $_tagName)
        {
            if ($_currentTag->_name == $_tagName)
            {
                return $_currentTag;
            }
           
            foreach ($_currentTag->_data as $_data)
            {
                if ($_data instanceof HtmlTag)
                {
                    $_matchingTag = self::FindTag($_data, $_tagName);
                   
                    if ($_matchingTag !== null)
                    {
                        return $_matchingTag;
                    }
                }
            }
           
            return null;
        }
       
        private static function ReconstructHtml(HtmlTag $_currentTag)
        {
            $_contents = '<' . $_currentTag->_name;
           
            if (!empty($_currentTag->_parameters))
            {
                $_contents .= ' ' . $_currentTag->_parameters;
            }
           
            if ($_currentTag->_isSelfClosing)
            {
                $_contents .= ' />';
            }
            else
            {
                $_contents .= '>';
               
                if (!$_currentTag->_isSingletonTag)
                {
                    foreach ($_currentTag->_data as $_data)
                    {
                        if ($_data instanceof HtmlTag)
                        {
                            $_contents .= self::ReconstructHtml($_data);
                        }
                        else
                        {
                            $_contents .= $_data;
                        }
                    }
     
                    $_contents .= '</' . $_currentTag->_name . '>';
                }
            }
                           
            return $_contents;
        }
    Then right before the SWIFT-1377 bug fix in the same file, in the function GetParsedContents, insert the following code:

    Code:
            if ($_isContentHTML)
            {
                $_contents = preg_replace('/\r?\n/', '', $_contents); // remove newlines
                $_length = strlen($_contents);
                $_offset = 0;
                $_rootTag = new HtmlTag(null, 'root');
               
                self::ParseHtml($_contents, $_length, $_offset, $_rootTag);
               
                $_firstTag = self::FindTag($_rootTag, 'body');
               
                if ($_firstTag === null)
                {
                    $_firstTag = $_rootTag;
                }
     
                $_contents = self::ReconstructHtml($_firstTag);
            }
    Boom - HTML emails render nicely in the staff cp now.

    A few notes:

    All HTML comments are thrown away. The <head> part of the HTML email is thrown away. Essentially it cares only about what is in the <body> part.

    I will post whatever improvements I make to this code.
     
    garyGBM likes this.
  9. Gary McGrath

    Gary McGrath Staff Member

    Looking good Marvin :)

    Gary
     
  10. Marvin Herbold

    Marvin Herbold Established Member

    Hey all - I made some improvements and fixes to this. But instead of posting the code changes here, you can download the file here:

    http://forge.kayako.com/projects/beautiful-html-ticket-posts

    Keep in mind that my code changes are for version 4.40.833 - if you are running a different version or have your own custom code in that file already, then you can run a code merge tool (my favorite is Beyond Compare)
     
  11. Marvin Herbold

    Marvin Herbold Established Member

  12. tallen-bt

    tallen-bt Established Member

    Wanted to post we have one ticket showing the following:
    Ticket # 3004 - accessibility issue..png
     
  13. Marvin Herbold

    Marvin Herbold Established Member

    Did you use version 1.01 of the source code from http://forge.kayako.com/projects/beautiful-html-ticket-posts and are you running version 4.40.833?

    Seems odd you are getting this error - it seems like it's trying to remove the tag that is at the root (meaning you'll have no tags left after the remove).

    Well, the fix is real easy - I'm going to post version 1.02 with this fix. But you'll end up with an empty body ticket post (not sure why - maybe your ticket post actually has nothing in it?)
     
  14. tallen-bt

    tallen-bt Established Member

    Yeah I just verified I am running that version, and used the forge source code.
     
  15. Marvin Herbold

    Marvin Herbold Established Member

    Ok - it seems like you're coming across a ticket post with nothing in the body - I'm writing a fix for that right now.
     
  16. tallen-bt

    tallen-bt Established Member

    Yeah I am looking at the parser log and I see no MIME-Data for the ticket.
     
  17. Marvin Herbold

    Marvin Herbold Established Member

    Ok, I just put up version 1.02 which should fix the case where the entire post is nothing but whitespace.

    Not sure if this will fix your case - you are saying you see no MIME data at all? Hmm. Give 1.02 a try and let me know if it works.
     
  18. tallen-bt

    tallen-bt Established Member

    Looks like that worked, one of the ticket replies is blank only with an attachment. Must of been the culprit.

    Thanks for the code and quick fix!!! :D
     
  19. Marvin Herbold

    Marvin Herbold Established Member

    Whoo hoo. I'm glad I was able to get that fixed quickly for you. :D Any feedback on this project in general? How are your HTML ticket posts looking?
     
  20. tallen-bt

    tallen-bt Established Member

    So far they are looking good. It is nice to see color and such again without the giant walls of text from the image code when we were interpreting emails as text. Signatures at our company contain 5 images of various sizes so we can end up with a lot of wasted space.
     

Share This Page