EBNF Grammar for Parsing Chrome Bookmarks

The bookmarks html exported by Chrome is not a valid html. It has different rules with a different DTD. Here is an ANTLR 4 grammar for parsing the bookmarks with support for unicode characters in bookmark names.

grammar Bookmarks;
 
document : prolog? misc* meta* misc* dl misc*;

prolog : DTD;

misc 
    : COMMENT 
    | S
    ;

meta 
    : '' TEXT ''
    | ''
    ;

dl : '' misc* dt* misc* '';

dt 
    : '' content '' 
    | ''
    | dl
    ;

attribute 
    : attributeName '=' attributeValue 
    | S
    ;

tag 
    : H3 
    | TEXT
    ;

attributeName : TEXT;

attributeValue : VAL;

content : TEXT+;

DTD : '';

COMMENT : '' S;

H3 : 'H3';

VAL : '"'.*?'"';

TEXT : [A-Za-z0-9:\/\.@\-_;\s*]+ | NameChar+;

fragment
NameChar
    : NameStartChar
    | '0'..'9'
    | '_'
    | '\u00B7'
    | '\u0300'..'\u036F'
    | '\u203F'..'\u2040'
    ;

fragment
NameStartChar
    : 'A'..'Z' | 'a'..'z'
    | '\u00C0'..'\u00D6'
    | '\u00D8'..'\u00F6'
    | '\u00F8'..'\u02FF'
    | '\u0370'..'\u037D'
    | '\u037F'..'\u1FFF'
    | '\u200C'..'\u200D'
    | '\u2070'..'\u218F'
    | '\u2C00'..'\u2FEF'
    | '\u3001'..'\uD7FF'
    | '\uF900'..'\uFDCF'
    | '\uFDF0'..'\uFFFD'
    ;

S : [ \t\r\n]+ -> skip;

The exported bookmarks sample.

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
     It will be read and overwritten.
     DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
    <DT><H3 ADD_DATE="1481473849" LAST_MODIFIED="1481473992" PERSONAL_TOOLBAR_FOLDER="true">Bookmarks bar</H3>
    <DL><p>
        <DT><H3 ADD_DATE="1481473866" LAST_MODIFIED="1481473967">Test 1</H3>
        <DL><p>
            <DT><A HREF="https://foo.example.com/" ADD_DATE="1481473884" ICON="">Foo Example</A>
            <DT><A HREF="https://example.com/" ADD_DATE="1481473892" ICON="">Example</A>
            <DT><A HREF="http://bar.example.com/" ADD_DATE="1481473954">Example Domain</A>
        </DL><p>
        <DT><H3 ADD_DATE="1481473872" LAST_MODIFIED="1481473980">Test 2</H3>
        <DL><p>
            <DT><A HREF="https://foo1.example.com/" ADD_DATE="1481473902" ICON="">Example 1</A>
            <DT><A HREF="https://foo2.example.com/" ADD_DATE="1481473936" ICON="">Example 2</A>
            <DT><A HREF="http://foo3.example.com/" ADD_DATE="1481473955">Example 3</A>
        </DL><p>
        <DT><A HREF="https://foo4.example.com/" ADD_DATE="1481473893" ICON="">Example 4</A>
        <DT><A HREF="http://foo5.example.com/" ADD_DATE="1481473986" ICON=""></A>
        <DT><A HREF="https://foo6.example.com/" ADD_DATE="1481473992" ICON=""></A>
        <DT><H3 ADD_DATE="1481474004" LAST_MODIFIED="1481477692">Test 3</H3>
        <DL><p>
            <DT><A HREF="https://foo7.example.com/" ADD_DATE="1481474004" ICON="">Example 7</A>
        </DL><p>
    </DL><p>
</DL><p>

clj-antlr library can be used to get the parse tree out of the grammar. Snippet to get the parse tree below. Use compiled version of the grammar for better performance.

(def bm (antlr/parser "/home/jsloop/dev/clojure/bookmarks-parser/grammar/Bookmarks.g4"))
(pprint (bm (slurp "/home/jsloop/dev/clojure/bookmarks-parser/resources/bookmarks.html")))