UGTS Document #3 - Last Modified: 8/29/2015 3:23 PM
How do I convert a Word document into clean HTML?
Normally, when you save a Word document as a web page, Word adds a multitude of special formatting directives, VML, and conditional comments so that the page is optimized to display just like it does in Word. The output web page displays optimally in Word or Internet Explorer, and it degrades gracefully if you try to view it in a different or older web browser.
While this is good if you're taking the HTML and using it in a standalone fashion such as an email or a single page, it is not good if your intention is to put the HTML into a website, standardize it into your style of CSS, hand-edit the content, and adjust tags in professional web editing software.
If you want the HTML to be clean and don't mind a temporary loss of formatting which requires some tweaking, then you can use the 'Web Options' feature in Word, and the UGTS program 'HTML Cleaner' to get what you're looking for.
First, create a temporary folder on your desktop where you'll be holding the intermediate files during the conversion. Open the document in Word 2000 or higher, and do Save As... Save as Type = Web Page (.htm). But instead of pressing Save, use the Tools button in the upper right corner, then go to 'Web Options'. On the General tab, uncheck 'Disable features not supported by:'. On the Pictures tab, check both 'Rely on VML...' and 'Allow PNG...'. On the Encoding tab, set 'Save this document as' = 'US-ASCII'. Then press OK, and Save. Save the output to your temporary folder.
Next, run HTML Cleaner and select the output file (or alternatively, add a shortcut to HTML Cleaner to your SendTo folder, and right-click the output file and send it to HTML Cleaner). When cleaning is done (it should be virtually instant), a new file will be output to the same folder with a '-Cleaned' filename, and HtmlCleaner will open it in a web browser. This file will contain only images, line breaks, and simple text formatting tags. Right click the file in the browser and do 'View Source'. Select all with Ctrl+A and paste into the Source view in a new page in your professional web editing program.
Next, take the images files from the conversion (if any) and copy the folder to the appropriate place on your website. Go back to the Source view of the new web page, and search-replace all instances of the old folder path for images with the new path needed. UGTS recommends that you store the images with the html file like Windows does it, at least at first, until you know that you will need to share some of the image files between web pages. It is far easier to do it this way than to attempt to rename dozens of image files. After you've fixed all the image links, view the web page in your web editor and verify that no content has been lost.
Now, remove the leading and trailing HTML and replace with any standard HTML master page declarations that your website uses, and view the page again, verifying that the master page layout is being applied correctly.
Next, have open the web page in the Design view, and the original Word document open, so that you can view them side by side. Define any styles that you may need in your website's CSS, and then use the Apply Styles option of your web editor to re-apply the styles that you need to your web content. Also take any bulleted lists that the Word document had, and replace them with actually HTML bulleted lists. Repeat the reformatting until your webpage looks like you want it to.