Tim's PHP Scripts

Word to HTML

(wordtohtml)

This php script will convert a Word .DOCX file to html and display the resultant code (including images) in a web page. It will recognise nearly all the formatting, themes, images tables etc. in the original Word DOCX document. It will also display any Mathematical Equations in the document either standalone or inline with normal text. The only significant exceptions are Wordart and tabs. Tabs are very difficult to replicate in HTLM due to a web page page width being very flexible and the different sizes of screens.

The resultant HTML should look very much like the original.

PHP 5 or greater is needed. Will work on PHP 8.2 or later.

There is a test/demonstration page here where you can try this script out with one of your own Word documents. Let me know if you see any issues or errors,


Features

  1. It will use the correct font and font size (assuming that a common font is being used).
  2. Text formating - Bold, Underlining (various styles/colours), Italic, Single & Double Strikethrough, Superscript and Subscript are all replicated, along with text alignment : Left, Centre, Right, Justified. Also indented and hanging text.
  3. Will recognise horizontal lines across the page - dotted, dashed, double and thicker single lines.
  4. It will display multi-level lists with the same alpha-numeric numbering as per the original word document.
  5. It will display tables and will cope with merged cells, cell borders and cell colours etc.
  6. In the default mode, tables are replicated as near as possible to the original DOCX word document formatting and relative size. E.g. if a table takes half the width of the pages then it will take up half the width of the screen. They will also be left, centre or right aligned as per Word. Also allows for text to be parallel with a table - either left or right side. However an option is provided for the tables to take always take up 100% of screen width (to allow for better display in mobile devices etc.). A 'DIV' is placed around all tables with the class name of 'tab' to allow for external common CCS formatting if required.
  7. Both footnote and endnote references are located in the correct place in the text. All the actual footnotes and endnotes are located at the end of the text (difficult to put them anywhere else in a web page). Links are provided to jump from a reference to actual note and then back again.
  8. The bookmarks in a 'Table of Contents' or similar provide a link to the correct section of the document as per the original Word document. A return link is also provided.
  9. In the default mode, images are formatted, sized and located very similarly to the original DOCX word document, which is fine for desktop computers. However an option is provided to allow for external CSS formatting to be used instead (e.g. to allow for better display in mobile devices etc.). In this mode each image is given a unique CSS class name - 'Wimg1' for the first image. 'Wimg2' for the second image, etc. to enable formatting of each image as desired. There is also an option to omit images from the resultant HTML if this is desired.
  10. Href hyperlink targets follow what is set in Word - enables one to select whether to open a link in the same tab/window, or a new one.
  11. Will recognise and display most images from the Word document, including transparency in PNG images. Will replicate approximately the positioning and flow of text and images.
  12. Will recognise when images are cropped and display the correct cropped image.
  13. Will recognise when images have been rotated (0-360deg) or flipped in Word and display the image correctly.
  14. By default images are saved into the 'images' directory, which is automatically created if it does not exist. An option is provided to enable the name of this directory to be changed if desired.
  15. The browser is now prevented from using cached images. Avoids caching using old images instead of the new ones when updating the images in a Word document.
  16. Text boxes are recognised along with any rotation they may have. Note 180deg rotation is often used as it is the only practical method of having upside-down text in Word. I have tried to replicate approximately, whether they are near the left, centre or right of the page. Note that due to a combination of how Word puts them in the DOCX file and the vagaries of trying to accurately position things in html, the position of these text boxes is only approximate, unless they are constrained in a table cell.
  17. It will recognise symbols from most of the symbol character sets used in Word (Wingdings, Wingdings 2, Wingdings 3, Webdings, Symbol, Zapf Dingbats). Unfortunately, in the main, these character sets are not commonly available on the web. However most of the characters or equivalents are available in the Unicode character set so these are used instead. Available when using php 7.2 and above. Please note that not all browsers can display the full Unicode character set.
  18. Will recognise nearly all Word Mathematical Equations, both standalone and inline with text. It does this using the online version of Mathjax (so internet access is required for this). Note that Mathjax does not support the Surface Integral and Volume Integral symbols, so multiple Line Integral Symbols are used instead. Also the Double Square Bracket is not supported, so any occurrences of this are replaced by the Double Pipe.
  19. It will now also display headers and footers including any images in them. The default is to show them - when there is more than one set of headers and footers in the document it shows:-
    a) Second page onwards when the first page is different.
    b) Odd numbered pages when the document has different ones for odd and even pages.
    You can select whether to show the default headers/footers, or one of the others if there is more than one. You can also opt not to display them.
  20. The resultant html code is designed to be used either as is, or (after saving) included in another html file. However an option is provided to add a html header, so that after saving it can be used as a standalone file (along with any images that it contains).

The latest version of this script (v.2.1.14) can be downloaded from either:-

Github - https://github.com/timy352/wordtohtml

PHP Classes - https://www.phpclasses.org/package/12250-PHP-Convert-Microsoft-Word-DOCX-document-to-HTML.html