Skip to content

Export posts out of NodeBB into HTML and Markdown flat files -> Halo ITSM

Guides
1 1 653 1
  • At work, we are transitioning from NodeBB for our Knowledge Base to Halo ITSM, which we require for SOC2 compliance amongst other things. Because I had 165 articles in NodeBB I didn’t want to have to re-type, or even copy and paste, I decided to write a Python script to walk the target category and create a file for each.

    Here’s the script to complete that. There are a number of prerequisities here, which I’ve identified below

    import os
    import re
    import time
    import requests
    import html2text
    from datetime import datetime
    
    # --- CONFIGURATION ---
    # Your Forum URL goes here
    BASE_URL = "https:/yourforum.com"
    #The category ID you want to target goes here
    CATEGORY_ID = 3
    # In my case, I needed to define a new "home" for the exported files under `/public/uploads` as this contained all the images I needed to embed into the new flat files. Therefore, ASSET_DOMAIN is nothing more than a basic website where I can grab the images from afterwards.
    ASSET_DOMAIN = "https://assetlocation.com"
    # The below directories are created at the same level as the script. If they do not exist, you need to create them. They will contain both `HTML`  and `markdown` copies of the posts.
    HTML_DIR = "nodebb_export_html"
    MD_DIR = "nodebb_export_markdown"
    os.makedirs(HTML_DIR, exist_ok=True)
    os.makedirs(MD_DIR, exist_ok=True)
    
    h = html2text.HTML2Text()
    h.ignore_links = False
    h.body_width = 0
    
    page = 1
    total_exported = 0
    
    print(f"🔄 Starting export for category {CATEGORY_ID} from {BASE_URL}")
    
    while True:
        print(f"📄 Fetching page {page}...")
        url = f"{BASE_URL}/api/category/{CATEGORY_ID}?page={page}"
        res = requests.get(url, timeout=10)
        if res.status_code != 200:
            print(f"❌ Failed to fetch page {page}: {res.status_code}")
            break
    
        data = res.json()
        topics = data.get("topics", [])
        if not topics:
            print("✅ No more topics found. Export complete.")
            break
    
        for topic in topics:
            tid = topic['tid']
            title = topic['title']
            print(f"→ Exporting topic {tid}: {title}")
    
            topic_url = f"{BASE_URL}/api/topic/{tid}"
            topic_res = requests.get(topic_url, timeout=10)
            if topic_res.status_code != 200:
                print(f"⚠️ Failed to fetch topic {tid}")
                continue
    
            topic_data = topic_res.json()
            posts = topic_data.get("posts", [])
            tags = topic_data.get("topic", {}).get("tags", [])
            tag_list = ", ".join(tags) if tags else ""
    
            safe_title = title.replace(' ', '_').replace('/', '-')
            html_file = f"{HTML_DIR}/{tid}-{safe_title}.html"
            md_file = f"{MD_DIR}/{tid}-{safe_title}.md"
    
            # --- HTML EXPORT ---
            with open(html_file, "w", encoding="utf-8") as f_html:
                f_html.write(f"<html><head><title>{title}</title></head><body>\n")
                f_html.write(f"<h1>{title}</h1>\n")
                if tag_list:
                    f_html.write(f"<p><strong>Tags:</strong> {tag_list}</p>\n")
    
                for post in posts:
                    username = post['user']['username']
                    content_html = post['content']
                    timestamp = datetime.utcfromtimestamp(post['timestamp'] / 1000).strftime('%Y-%m-%d %H:%M:%S UTC')
                    pid = post['pid']
    
                    # Rewrite asset paths in HTML
                    content_html = re.sub(
                        r'src=["\'](/assets/uploads/files/.*?)["\']',
                        rf'src="{ASSET_DOMAIN}\1"',
                        content_html
                    )
                    content_html = re.sub(
                        r'href=["\'](/assets/uploads/files/.*?)["\']',
                        rf'href="{ASSET_DOMAIN}\1"',
                        content_html
                    )
    
                    f_html.write(f"<div class='post'>\n")
                    f_html.write(f"<h3><strong>Original Author: {username}</strong></h3>\n")
                    f_html.write(f"<p><em>Posted on: {timestamp} &nbsp;|&nbsp; Post ID: {pid}</em></p>\n")
                    f_html.write(f"{content_html}\n")
                    f_html.write("<hr/>\n</div>\n")
    
                f_html.write("</body></html>\n")
    
            # --- MARKDOWN EXPORT ---
            with open(md_file, "w", encoding="utf-8") as f_md:
                # Metadata block
                f_md.write(f"<!-- FAQLists: Knowledge Base -->\n")
                if tag_list:
                    f_md.write(f"<!-- Tags: {tag_list} -->\n")
                f_md.write("\n")
    
                f_md.write(f"# {title}\n\n")
    
                for post in posts:
                    username = post['user']['username']
                    content_html = post['content']
                    timestamp = datetime.utcfromtimestamp(post['timestamp'] / 1000).strftime('%Y-%m-%d %H:%M:%S UTC')
                    pid = post['pid']
    
                    # Convert HTML to Markdown
                    content_md = h.handle(content_html).strip()
    
                    # Rewrite asset paths
                    content_md = re.sub(
                        r'(!\[.*?\])\((/assets/uploads/files/.*?)\)',
                        rf'\1({ASSET_DOMAIN}\2)',
                        content_md
                    )
                    content_md = re.sub(
                        r'(\[.*?\])\((/assets/uploads/files/.*?)\)',
                        rf'\1({ASSET_DOMAIN}\2)',
                        content_md
                    )
    
                    f_md.write(f"**Original Author: {username}**\n\n")
                    f_md.write(f"_Posted on: {timestamp}  |  Post ID: {pid}_\n\n")
                    f_md.write(f"{content_md}\n\n---\n\n")
    
            total_exported += 1
            print(f"✔ Saved: {html_file} & {md_file}")
    
        page += 1
        time.sleep(1)
    
    print(f"\n🎉 Done! Exported {total_exported} topics to '{HTML_DIR}' and '{MD_DIR}'")
    
    

    Run the script using python scriptname.py.

    If the script fails, it’s likely because you do not have the required modules installed in Python

    import os
    import re
    import time
    import requests
    import html2text
    

    In this case, you’d need to install them using (for example) pip install html2text

    To get them into an Excel file where they can all be bulk imported, we’d then use something like the below script

    import os
    import re
    import pandas as pd
    from datetime import datetime
    import markdown
    
    # --- CONFIGURATION ---
    export_dir = "nodebb_export_markdown"
    output_file = "Halo_KB_Import_HTML.xlsx"
    # This value can be whatever suits your needs
    created_by = "Import"
    today = datetime.today().strftime('%Y-%m-%d')
    
    # --- BUILD DATAFRAME FOR HALO ---
    import_rows = []
    
    for filename in sorted(os.listdir(export_dir)):
        if filename.endswith(".md"):
            filepath = os.path.join(export_dir, filename)
            with open(filepath, "r", encoding="utf-8") as f:
                lines = f.readlines()
    
            # Default values
    # Change "Knowledge Base" to reflect what you are using in Halo
            faqlists = "Knowledge Base"
            tags = ""
    
            # Parse metadata comments from top of file
            metadata_lines = []
            while lines and lines[0].startswith("<!--"):
                metadata_lines.append(lines.pop(0).strip())
    
            for line in metadata_lines:
                faq_match = re.match(r"<!-- FAQLists:\s*(.*?)\s*-->", line)
                tag_match = re.match(r"<!-- Tags:\s*(.*?)\s*-->", line)
    
                if faq_match:
                    faqlists = faq_match.group(1)
                if tag_match:
                    tags = tag_match.group(1)
    
            markdown_content = ''.join(lines)
            html_content = markdown.markdown(markdown_content)
    
            # Extract summary from filename
            summary = filename.split('-', 1)[1].rsplit('.md', 1)[0].replace('_', ' ')
    
            import_rows.append({
                "Summary": summary,
                "Details": html_content,
                "Resolution": "",
                "DateAdded": today,
                "CreatedBy": created_by,
                "FAQLists": faqlists,
                "Tags": tags
            })
    
    # --- EXPORT TO EXCEL ---
    df = pd.DataFrame(import_rows)
    df.to_excel(output_file, index=False)
    
    print(f"✅ Done! Halo HTML import file created: {output_file}")
    

    This then generates a file called Halo_KB_Import_HTML.xlsx which you can then use to import each exported post into Halo.

    Cool eh? Huge time saver 🙂


Related Topics
  • Custom html in nodebb to prevent cache

    Unsolved Configure nodebb
    18
    2 Votes
    18 Posts
    3k Views
    @Panda You’ll need to do that with js. With some quick CSS changes, it looks like this [image: 1690796279348-d619844f-fbfe-4cf1-a283-6b7364f6bf18-image.png] The colour choice is still really hard on the eye, but at least you can now read the text
  • Email validation NodeBB

    Bugs nodebb
    21
    3 Votes
    21 Posts
    6k Views
    @Panda said in Email validation NodeBB: Did you configure that as a custom change to the usual quote icon. How do you do that? I notice on NodeBB site its a solid blue quotes Yes, I changed it. NodeBB by default users the free font awesome library whereas I use the pro (paid) version SDK have access to a wider set of icons, and at different thicknesses etc. The change of colour is just simple CSS.
  • NodeBB: The global Search option

    Solved Configure search nodebb
    5
    0 Votes
    5 Posts
    963 Views
    @mventures Yes, exactly. The other icon will restart NodeBB whilst the first icon I referenced will rebuild (recompile) it. The huge strength of NodeBB over Flarum (for example) is that the code is precompiled, and called once at boot. PHP’s code has to repeatedly reload code from source making it much slower.
  • 36 Votes
    55 Posts
    15k Views
    @DownPW I see why. The code relies on the existence of [component="topic/quickreply/container"] However, this by definition means that the below has to be enabled [image: 1679077966615-aeef638f-4188-489d-a9f2-f3a26dbca9d8-image.png] It will then work [image: 1679077992245-7fb38631-e0f3-46ef-b652-00929d927b13-image.png] For some unknown reason, this is hidden in Harmony, and only shows if you select it. In v2, it seems that the <section> is deleted altogether in Persona if “Quick Reply” is disabled, meaning it won’t fire as it can’t locate that specific component. The downside is that you might not want the quick reply function, but I think it’s a PITA to scroll up to the top of the post just to reply, so I have it on
  • Blinking text Effect

    Customisation nodebb text effect css
    3
    5 Votes
    3 Posts
    1k Views
    @phenomlab I love it too @phenomlab said in Blinking text Effect: Has that “broken neon light” look that you see in films. It’s exactly that, kind of old neon signs of bar or pubs a bit cyberpunk too
  • Bootstrap Version

    Solved Customisation nodebb bootstrap
    8
    5 Votes
    8 Posts
    2k Views
    @phenomlab That will be nice once they have completed that. It will be interesting to see how long that takes. So for now I will use custom css to make it look the way I want. Frameworks just make things a little faster. Thanks @phenomlab
  • Detect if user is admin

    General nodebb javascript code
    2
    0 Votes
    2 Posts
    1k Views
    @pwsincd I think you can use userData.isAdmin = isAdmin; if I’m not mistaken - see https://community.nodebb.org/topic/15128/how-to-hide-whitelist-user-field-only-to-owner-or-admin?_=1648802303112 for an example
  • Adjusting the size of boxes in posts-list class

    Solved Customisation nodebb
    3
    3 Votes
    3 Posts
    1k Views
    @phenomlab thanks a lot, this combination works best .posts-list .posts-list-item .content { overflow: auto; max-height: 600px; }