zim-parser-0.2.1.0: Read and parse ZIM files

Copyright(c) Robbin C.
LicenseGPLv3
MaintainerRobbin C.
Stabilityunstable
Portabilityportable
Safe HaskellNone
LanguageHaskell2010

Codec.Archive.Zim.Parser

Contents

Description

This is a library for parsing ZIM (http://openzim.org) files. ZIM files contain offline web content (eg, Wikipedia) which can be browsed locally without an Internet connection.

The API is meant to be intuitive for normal use-cases.

To get content for "A/index.htm" from ZIM file "file.zim":

> mimeContent <- "file.zim" `getContent` Url "A/index.htm"
> :t mimeContent
mimeContent :: Maybe (B8.ByteString, BL.ByteString)
> print mimeContent
Just ("text/html", "<html><head>...</html>")

The above will open the file, parse the ZIM header, lookup the MIME type and content of the URL, close the file and return the MIME type and content as a pair. Note that content is a lazy bytestring.

The above operation should suffice for a simple webserver serving a ZIM file. For finer control, it is possible to cache and reuse the file handle and the ZIM header.

> hdl <- openBinaryFile "file.zim" ReadMode
> hdr <- getHeader hdl
> :t hdr
hdr :: ZimHeader
> (hdl, hdr) `getContent` Url "A/index.htm"
Just ("text/html", "<html><head>...</html>")

ZIM files of Wikimedia Foundation (Wikipedia, Wikibooks, etc) can be found at http://ftpmirror.your.org/pub/kiwix/zim.

Below is a full example of a Scotty web server that serves a ZIM file (specified on command line) on localhost port 3000:

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad.IO.Class (liftIO)
import Data.Text.Lazy (toStrict, fromStrict)
import Data.Text.Encoding (decodeUtf8, encodeUtf8)
import System.Environment (getArgs)
import Network.HTTP.Types.Status (status404)
import Web.Scotty
import Codec.Archive.Zim.Parser (getMainPageUrl, getContent, Url(..))

main :: IO ()
main = do
    [fp] <- getArgs
    scotty 3000 $ do
      get "/" (redirectToZimMainPage fp)
      get (regex "^/(./.*)$") (serveZimUrl fp)
      notFound $ text "Invalid URL!"

redirectToZimMainPage :: FilePath -> ActionM ()
redirectToZimMainPage fp = do
    res <- liftIO $ getMainPageUrl fp
    case res of
      Nothing -> do
        status status404
        text "This ZIM file has no main page specified!"
      Just (Url url) -> redirect . fromStrict $ decodeUtf8 url

serveZimUrl :: FilePath -> ActionM ()
serveZimUrl fp = do
    url <- (encodeUtf8 . toStrict) <$> param "1"
    res <- liftIO $ fp `getContent` Url url
    case res of
      Nothing -> do
        liftIO . putStrLn $ "Invalid URL: " ++ show url
        status status404
        text $ "Invalid URL!"
      Just (mimeType, content) -> do
        liftIO . putStrLn $ "Serving: " ++ show url
        setHeader "Content-Type" (fromStrict $ decodeUtf8 mimeType)
        raw content

Feedback and contributions are welcome on http://github.com/robbinch/zim-parser.

Synopsis

Functions

getDE :: (ZimGetDE k, RunZim h) => h -> k -> IO ZimDirEnt #

getMainPageUrl :: RunZim h => h -> IO (Maybe Url) #

Returns URL of main page in ZIM. This URL can be used for redirecting to the actual page.

getContent :: (ZimGetContent k, RunZim h) => h -> k -> IO (Maybe (ByteString, ByteString)) #

Get (MIME type, Content). Note that Content is lazy.

searchDE :: (ZimSearchDE k, RunZim h) => h -> k -> IO [(Int, ZimDirEnt)] #

Search for a Directory Entry on a RunZim. When searching for a:

Url
Returns either 0 (not found) or 1 element.
Title
Returns either 0 (not found) or 1 element.
TitlePrefix
Returns either 0 (not found) or 2 elements corresponding to lower and upper bound of titles containing the prefix.

type MimeList = Array Int ByteString #

List of Mime Types

mkNsTitle :: Char -> ByteString -> Title #

Construct a Title with a Namespace.

mkNsTitlePrefix :: Char -> ByteString -> TitlePrefix #

Construct a TitlePrefix with a Namespace.

mkNsUrl :: Char -> ByteString -> Url #

Construct a Url with a Namespace.

class RunZim h #

Instances of this class represent a Zim File and are able to perform ZIM operations (getMimeList, getContent, etc). Valid instances include a Handle to a ZIM file, a FilePath to a ZIM file, or a (Handle, ZimHeader) where ZimHeader is parsed previously (so it does not need to be reparsed).

Minimal complete definition

runZim

Instances
RunZim Handle # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

runZim :: Handle -> (Handle -> ZimHeader -> IO a) -> IO a

RunZim FilePath # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

runZim :: FilePath -> (Handle -> ZimHeader -> IO a) -> IO a

RunZim (Handle, ZimHeader) # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

runZim :: (Handle, ZimHeader) -> (Handle -> ZimHeader -> IO a) -> IO a

class ZimGetDE k #

Minimal complete definition

getDE

Instances
ZimGetDE TitleIndex # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getDE :: RunZim h => h -> TitleIndex -> IO ZimDirEnt #

ZimGetDE UrlIndex # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getDE :: RunZim h => h -> UrlIndex -> IO ZimDirEnt #

class ZimSearchDE k #

Minimal complete definition

searchDE

Instances
ZimSearchDE TitlePrefix # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

searchDE :: RunZim h => h -> TitlePrefix -> IO [(Int, ZimDirEnt)] #

ZimSearchDE Title # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

searchDE :: RunZim h => h -> Title -> IO [(Int, ZimDirEnt)] #

ZimSearchDE Url # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

searchDE :: RunZim h => h -> Url -> IO [(Int, ZimDirEnt)] #

class ZimGetContent k #

Minimal complete definition

getContent

Instances
ZimGetContent Title # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getContent :: RunZim h => h -> Title -> IO (Maybe (ByteString, ByteString)) #

ZimGetContent Url # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getContent :: RunZim h => h -> Url -> IO (Maybe (ByteString, ByteString)) #

ZimGetContent TitleIndex # 
Instance details

Defined in Codec.Archive.Zim.Parser

ZimGetContent UrlIndex # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getContent :: RunZim h => h -> UrlIndex -> IO (Maybe (ByteString, ByteString)) #

ZimGetContent ZimDirEnt # 
Instance details

Defined in Codec.Archive.Zim.Parser

ZimGetContent (MimeList, Title) # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getContent :: RunZim h => h -> (MimeList, Title) -> IO (Maybe (ByteString, ByteString)) #

ZimGetContent (MimeList, Url) # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getContent :: RunZim h => h -> (MimeList, Url) -> IO (Maybe (ByteString, ByteString)) #

ZimGetContent (MimeList, TitleIndex) # 
Instance details

Defined in Codec.Archive.Zim.Parser

ZimGetContent (MimeList, UrlIndex) # 
Instance details

Defined in Codec.Archive.Zim.Parser

ZimGetContent (MimeList, ZimDirEnt) # 
Instance details

Defined in Codec.Archive.Zim.Parser

Exceptions

data ZimException #

Other than the below, ErrorCall can be thrown by LZMA library if there is a problem with decompression.

Constructors

ZimInvalidMagic

ZIM file has invalid magic number (anything other than 72173914).

ZimParseError String

There is an error in parsing.

ZimIncompleteInput

There is insufficient bytes required to parse.

ZimInvalidIndex Int

The given index (URL, title or cluster) is out of bounds for this ZIM file.

ZIM Header

data ZimHeader #

Constructors

ZimHeader 

Fields

Instances
Eq ZimHeader # 
Instance details

Defined in Codec.Archive.Zim.Parser

Show ZimHeader # 
Instance details

Defined in Codec.Archive.Zim.Parser

RunZim (Handle, ZimHeader) # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

runZim :: (Handle, ZimHeader) -> (Handle -> ZimHeader -> IO a) -> IO a

ZIM Directory Entry

data ZimDirEntType #

There are 4 types of directory entries. Most content in a ZIM file are usually ZimArticleEntry or ZimRedirectEntry.

data ZimDirEnt #

Constructors

ZimDirEnt 

Fields

Instances
Eq ZimDirEnt # 
Instance details

Defined in Codec.Archive.Zim.Parser

Show ZimDirEnt # 
Instance details

Defined in Codec.Archive.Zim.Parser

ZimGetContent ZimDirEnt # 
Instance details

Defined in Codec.Archive.Zim.Parser

ZimGetContent (MimeList, ZimDirEnt) # 
Instance details

Defined in Codec.Archive.Zim.Parser

newtype UrlIndex #

Wrapper for URL index

Constructors

UrlIndex Int 
Instances
Eq UrlIndex # 
Instance details

Defined in Codec.Archive.Zim.Parser

Ord UrlIndex # 
Instance details

Defined in Codec.Archive.Zim.Parser

Show UrlIndex # 
Instance details

Defined in Codec.Archive.Zim.Parser

ZimGetContent UrlIndex # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getContent :: RunZim h => h -> UrlIndex -> IO (Maybe (ByteString, ByteString)) #

ZimGetDE UrlIndex # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getDE :: RunZim h => h -> UrlIndex -> IO ZimDirEnt #

ZimGetContent (MimeList, UrlIndex) # 
Instance details

Defined in Codec.Archive.Zim.Parser

newtype TitleIndex #

Wrapper for Title index

Constructors

TitleIndex Int 

newtype BlobNumber #

Wrapper for Blob number

Constructors

BlobNumber Int 

newtype Cluster #

Wrapper for Cluster

Constructors

Cluster 

newtype Blob #

Wrapper for Blob

Constructors

Blob 

Fields

newtype Url #

Wrapper for Url

Constructors

Url ByteString 
Instances
Eq Url # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

(==) :: Url -> Url -> Bool #

(/=) :: Url -> Url -> Bool #

Ord Url # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

compare :: Url -> Url -> Ordering #

(<) :: Url -> Url -> Bool #

(<=) :: Url -> Url -> Bool #

(>) :: Url -> Url -> Bool #

(>=) :: Url -> Url -> Bool #

max :: Url -> Url -> Url #

min :: Url -> Url -> Url #

Show Url # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

showsPrec :: Int -> Url -> ShowS #

show :: Url -> String #

showList :: [Url] -> ShowS #

ZimSearchDE Url # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

searchDE :: RunZim h => h -> Url -> IO [(Int, ZimDirEnt)] #

ZimGetContent Url # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getContent :: RunZim h => h -> Url -> IO (Maybe (ByteString, ByteString)) #

ZimGetContent (MimeList, Url) # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getContent :: RunZim h => h -> (MimeList, Url) -> IO (Maybe (ByteString, ByteString)) #

data Title #

Wrapper for Title

Instances
Eq Title # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

(==) :: Title -> Title -> Bool #

(/=) :: Title -> Title -> Bool #

Ord Title # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

compare :: Title -> Title -> Ordering #

(<) :: Title -> Title -> Bool #

(<=) :: Title -> Title -> Bool #

(>) :: Title -> Title -> Bool #

(>=) :: Title -> Title -> Bool #

max :: Title -> Title -> Title #

min :: Title -> Title -> Title #

Show Title # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

showsPrec :: Int -> Title -> ShowS #

show :: Title -> String #

showList :: [Title] -> ShowS #

ZimSearchDE Title # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

searchDE :: RunZim h => h -> Title -> IO [(Int, ZimDirEnt)] #

ZimGetContent Title # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getContent :: RunZim h => h -> Title -> IO (Maybe (ByteString, ByteString)) #

ZimGetContent (MimeList, Title) # 
Instance details

Defined in Codec.Archive.Zim.Parser

Methods

getContent :: RunZim h => h -> (MimeList, Title) -> IO (Maybe (ByteString, ByteString)) #

ZIM file format

Following is a short summary of the ZIM file format. The authoritative reference is at http://www.openzim.org/wiki/ZIM_file_format.

1. ZIM header

This is an 80-byte header (see ZimHeader). Among other things, it contains file offsets to the below.

2. List of MIME types

This is a sequence of null-terminated strings (eg. text/html, text/javascript). The last string is zero length, so the end always consists of 2 consecutive null bytes.

3. List of URLs

This is a sequence of 8-byte file offsets, each pointing to a directory entry. This list is sorted by the directory entries' URL.

4. List of Titles

This is a sequence of 4-byte indices, each pointing to a URL above (which in turn point to a directory entry). This list is sorted by the directory entries' Title.

5. Directory Entries

This is a sequence of Directory Entries (see ZimDirEnt). The first 2 bytes determine the type of this entry, which also determine the length. Contents include:

a. MIME type

This 2-byte field means:

0xffff
This directory entry is a ZimRedirectEntry.
0xfffe
This directory entry is a ZimLinkTarget.
0xfffd
This directory entry is a ZimDeletedEntry.
any other value
This directory entry is a ZimArticleEntry and this index into the MIME list from above determines its MIME type.

b. Namespace

This single character determines the directory entry's namespace. (eg. A for articles, I for images, etc.) The comprehensive list is at http://www.openzim.org/wiki/ZIM_file_format#Namespaces.

c. Cluster and Blob number

Only for ZimArticleEntry, this is the directory entry's Cluster and Blob number. The Cluster number is a 4-byte index into the list of Clusters below. The Blob number refers to a block inside the (decompressed) cluster. Together, they provide the content of this directory entry.

d. URL and Title

These 2 null-terminated strings represent the URL and Title of this directory entry respectively. If the Title is empty, it is taken to be the same as the URL.

6. List of Clusters

This is a list of 8-byte file offsets, each pointing to a cluster in the file. -- The end of a cluster is also the start of the next cluster. Therefore, the length of a cluster is the difference between the adjacent offsets. For the last cluster, the end is the Checksum file offset, as the Checksum is always the last 16 bytes of a ZIM file.

a. Compression Type

The first byte of the cluster determines if it is uncompressed (eg. PNG image) or compressed with LZMA (eg. HTML).

0 or 1
No compression
4
Compressed with LZMA

b. List of Blobs

This is a list of 4-byte offsets, each pointing inside this cluster. The end of a blob is also the start of the next blob. Therefore, the length of a blob is the difference between the adjacent offsets. The last offset points to the end of the data area so there is always one more offset than blobs.