Creating PDFs with Rust

I have been archiving content on a Blogger website and was curious how feasible it would be to create a PDF document of the entire blog's content. I've been running a scraper that saves each blog post content as JSON, and have previously made this content available in an iOS app. There were several PDF crates available to choose from and deserializing JSON into strings and writing them to a PDF document was fairly straightforward. Including images in the PDF was a bit tricker.

Initially I used printpdf but found it to be too low level, and required a lot of tweaking and customizations involving calculating the available page size depending upon font size, weight, etc. genpdf is built on top of printpdf (rusttype as well) and provides a higher level API for quickly generating PDFs. With the help of serde_json I had nearly everything necessary to get started.

I defined a struct describing how I've saved blog posts as JSON. Each post has an associated title, post content, date, and a variable number of embedded images. The JSON I've been using stores only the web address of these images on the Blogger site in order to load them in the app.
These images were the most difficult part of the project. The blog doesn't use consistent image file types and so I needed to spend some time looking for a flexible solution that would handle a wide variety.
I needed to write a small binary to download the image data as raw bytes using these URLs. I used the crate bincode to serialize a HashMap having image URLs as keys and raw bytes as values. This allows the HashMap to be "saved" as any file would, and then "loaded" by the main crate discussed on this page.
This allows the PDF generation logic to perform quick lookups of image URLs in the JSON data and use these existing bytes to embed images in the PDF, which makes the process much faster when iterating through the development of the PDF generator, since it avoids having to re-download all of the image data at each iteration.
I've liked using progress bar indicators in my CLIs ever since I discovered tqdm for Python.


 
#[derive(Debug, Deserialize)]
struct Post {
    title: String,
    content: String,
    date: String,
    images: Vec<String>,
}   

//  binary used to get image data as bytes 
fn main() -> Result<(), Box<dyn std::error::Error>> {
    let posts: Vec<Post> = serde_json::from_reader(File::open("backup.json")?)?;
    let image_data: Mutex<HashMap<String, Vec<u8>>> = Mutex::new(HashMap::new());

    let bar = ProgressBar::new(posts.len() as u64);
    bar.set_style(
        ProgressStyle::default_bar()
            .template(
                "{spinner:.green} [{elapsed_precise}] [{bar:40.cyan/blue}] {pos}/{len} {msg}",
            )?
            .progress_chars("#>-"),
    );

    posts.par_iter().for_each(|post| {
        bar.set_message(format!("Downloading image for: {}", post.title));
        for image_url in &post.images {
            match download_image(image_url) {
                Ok(data) => {
                    let mut data_lock = image_data.lock().unwrap();
                    data_lock.insert(image_url.clone(), data);
                }
                Err(e) => {
                    eprintln!("Failed to download {}: {}", image_url, e);
                }
            }
        }
        bar.inc(1);
    });
    bar.finish_with_message("Image downloads complete!");

    // Serialize and save image data using bincode
    save_image_data(&image_data.lock().unwrap(), "images.bin")?;

    Ok(())
}

fn download_image(url: &str) -> Result<Vec<u8>, Box<dyn std::error::Error>> {
    let response = blocking::get(url)?;
    Ok(response.bytes()?.to_vec())
}

fn save_image_data(image_data: &HashMap<String, Vec<u8>>, path: &str) -> io::Result<()> {
    let encoded: Vec<u8> = bincode::serialize(image_data).expect("Serialization failed");
    let mut file = File::create(path)?;
    file.write_all(&encoded)?;
    Ok(())
}

fn load_image_data(path: &str) -> io::Result<HashMap<String, Vec<u8>>> {
    let mut file = File::open(path)?;
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer)?;
    let decoded: HashMap<String, Vec<u8>> =
        bincode::deserialize(&buffer).expect("Deserialization failed");
    Ok(decoded)
}

That image data can now be loaded back into a HashMap for quick lookups of image data.
JSON data describing archived blog content is also loaded into a Vec of posts and sorted to match chronological order. This means the first pages of the PDF will be the earliest blog posts.
I've used some Regex to adjust how line spaces are translated from the JSON strings into PDF paragraphs.
genpdf exposes some API methods to set formatting options such as font, page size, margins, and more. I've mostly used standard options, with the addition of a Google Font (Open Sans) that I like.
I've set up another progress indicator, it's always nice to know what a CLI is up to during execution.


// binary focused on PDF generation
fn main() -> Result<(), Box<dyn std::error::Error>> {
    let image_data: HashMap<String, Vec<u8>> = load_image_data("images.bin")?;
    let mut posts: Vec<Post> = serde_json::from_reader(File::open("backup.json")?)?;
    posts.sort_by_key(|post| NaiveDate::parse_from_str(&post.date, "%A %d %B %Y").ok());

    let newlines_re = Regex::new(r"\s*\n\s*").unwrap();

    let font_family =
        genpdf::fonts::from_files("./fonts", "OpenSans", None).expect("Failed to load font family");

    let mut decorator = genpdf::SimplePageDecorator::new();
    decorator.set_margins(10);

    let mut doc = genpdf::Document::new(font_family);
    doc.set_paper_size(genpdf::PaperSize::A4);
    doc.set_page_decorator(decorator);

    let bar = ProgressBar::new(posts.len() as u64);
    bar.set_style(
        ProgressStyle::default_bar()
            .template(
                "{spinner:.green} [{elapsed_precise}] [{bar:40.cyan/blue}] {pos}/{len} {msg}",
            )?
            .progress_chars("#>-"),
    );

The sorted Vec of blog posts is iterated over and the title of each post is "pushed" onto each page of the PDF, along with its date and main content.


for post in &posts {
    bar.set_message(format!("Processing post: {}", post.title));

    doc.push(elements::PageBreak::new());

    let title_style = style::Style::new().with_font_size(24).bold();
    let title_element = elements::Paragraph::new(post.title.clone().replace("&", "&"))
        .aligned(Alignment::Left)
        .styled(title_style);
    doc.push(title_element);

    let date_style = style::Style::new().with_font_size(12).bold();
    let date_element = elements::Paragraph::new(post.date.clone())
        .aligned(Alignment::Left)
        .styled(date_style);
    doc.push(date_element);

    doc.push(elements::Break::new(1));
    let cleaned_content = newlines_re.replace_all(&post.content, "\n");
    let content_lines = cleaned_content.lines();
    for line in content_lines {
        let content_element = elements::Paragraph::new(line.replace("&", "&").to_string());
        doc.push(content_element);
    }

Lastly the images for each post (if any) are pushed onto the page.
Unfortunately genpdf doesn't support images with a transparency layer, so I've needed to exclude these for now.
genpdf does expose a from_dynamic_image() method expecting a Vec of bytes, which is perfect for how I've saved image data.
In order to get this to work, I did need to add the image crate and enable feature flags for all of the specific file types I wanted to render.
If image data isn't present for a given URL, I added logic to download it from the remote address.
Lastly I've used the genpdf create() method to render everything in a PDF file written to disk.


        for image_url in &post.images {
            doc.push(elements::Break::new(1));
            match image_data.get(image_url) {
                Some(image_data) => {
                    let image_data = image::load_from_memory(&image_data)?;
                    if image_data.color().has_alpha() {
                        continue; // images with alpha channel not supported
                    }
                    let image = genpdf::elements::Image::from_dynamic_image(image_data)?
                        .with_alignment(genpdf::Alignment::Center)
                        .with_scale(genpdf::Scale::new(3, 3));

                    doc.push(image);
                }
                None => {
                    bar.set_message(format!("Downloading image for: {}", post.title));
                    match download_image(image_url) {
                        Ok(image_data) => {
                            let loaded_image = image::load_from_memory(&image_data)?;
                            if loaded_image.color().has_alpha() {
                                continue; // images with alpha channel not supported
                            }
                            let image = genpdf::elements::Image::from_dynamic_image(loaded_image)?
                                .with_alignment(genpdf::Alignment::Center)
                                .with_scale(genpdf::Scale::new(3, 3));

                            doc.push(image);
                        }
                        Err(e) => {
                            eprintln!("Failed to download image: {}", e);
                        }
                    }
                }
            }
        }

        bar.inc(1);
    }

    bar.finish_with_message("PDF generation complete!");
    println!("Rendering and saving PDF, this may take a few minutes...");
    let output_file = File::create("Blog.pdf")?;
    doc.render(&mut std::io::BufWriter::new(output_file))?;
    println!("All done!");

    Ok(())
}

Final Thoughts

Rust is such an enjoyable language to work with, and its enthusiastic community and availability of crates makes it my go to for any project idea.