Content

You Read It But Can You Save It?

Jan. 27, 2012

You get to read it online but can you take it with you? Maybe HTML 5 will be the answer.

Did you ever come across a great article and want to save it? Sure you can bookmark it but what if you really want to save it as a file. This is where things get a bit more difficult.

You can try printing it and get hard copy. You can even use Adobe's PDF tools or the free, open source PDFCreator.

You might try saving the file using the browser. Most can save a page as an HTML file or even create a directory to store off artifacts like images.

If all else fails, you can use screen capture and print the bitmap file.

In general though, unless there is a PDF to download or a print option to create your own PDF or hard copy you are stuck with a lot of junk. Sometimes you may not even be able to print or save the content.

In some instances it is a form of digital rights management (DRM) where the site's owners do not want you to be able to save anything. Often this problem is a lack of good website design. Some sites, like Wikipedia, provide ebook creation as a service.

Part of the problem for web browsers is that web pages usually have lots of extraneous information in addition to an article. HTML 4 does little to let the browser know where it is. In most cases, acquisition tools such as printing or saving simply take in everything. HTML 5 might make some of these issues easier to deal with (see What's The Difference: Between HTML 4 and HTML 5). It has tags that identify content and subcontent. In theory it should be easier to locate content and hence save it.

Plug-ins like GrabMyBooks for Firefox or dotEPUB for Chrome allow saving of content as ebooks. Applications like Calibre, a free ebook manager, provide conversion tools so PDF files can be changed to EPUB files.

Unfortunately all these tools are marginal at best. Often the limitations are due to decoding or encoding issues. Some file formats are not as expressive as others. Sometimes it is just a matter of not having a complete translation or identification mechanism. The result is usually missing or junk tables, odd size or missing images, and sometimes a totally useless file.

Why is this an important issue? Not all useful information is in downloadable app notes or spec sheet PDFs.

Likewise, content may be used later on different platforms. I have a smartphone, e-reader and numerous PCs. All work better with different file formats. All are often used when not attached to the Internet so offline content is critical.

Right now you need to be an expert in capture, formats and conversions to save online information.

Finally there is the issue of having the content available to you when you want it. If you have a file then it shouldn't change even if the original moves or disappears from the Internet.

So how do you save content? Maybe next time we can talk about video.