I have been archiving content on a Blogger website and was curious how feasible it would be to create a PDF document of the entire blog's content. I've been running a scraper that saves each blog post content as JSON, and have previously made this content available in an iOS app. There were several PDF crates available to choose from and deserializing JSON into strings and writing them to a PDF document was fairly straightforward. Including images in the PDF was a bit tricker.
Initially I used printpdf but found it to be too low level, and required a lot of tweaking and customizations involving calculating the available page size depending upon font size, weight, etc. genpdf is built on top of printpdf (rusttype as well) and provides a higher level API for quickly generating PDFs. With the help of serde_json I had nearly everything necessary to get started.
#[derive(Debug, Deserialize)]
struct Post {
title: String,
content: String,
date: String,
images: Vec<String>,
}
// binary used to get image data as bytes
fn main() -> Result<(), Box<dyn std::error::Error>> {
let posts: Vec<Post> = serde_json::from_reader(File::open("backup.json")?)?;
let image_data: Mutex<HashMap<String, Vec<u8>>> = Mutex::new(HashMap::new());
let bar = ProgressBar::new(posts.len() as u64);
bar.set_style(
ProgressStyle::default_bar()
.template(
"{spinner:.green} [{elapsed_precise}] [{bar:40.cyan/blue}] {pos}/{len} {msg}",
)?
.progress_chars("#>-"),
);
posts.par_iter().for_each(|post| {
bar.set_message(format!("Downloading image for: {}", post.title));
for image_url in &post.images {
match download_image(image_url) {
Ok(data) => {
let mut data_lock = image_data.lock().unwrap();
data_lock.insert(image_url.clone(), data);
}
Err(e) => {
eprintln!("Failed to download {}: {}", image_url, e);
}
}
}
bar.inc(1);
});
bar.finish_with_message("Image downloads complete!");
// Serialize and save image data using bincode
save_image_data(&image_data.lock().unwrap(), "images.bin")?;
Ok(())
}
fn download_image(url: &str) -> Result<Vec<u8>, Box<dyn std::error::Error>> {
let response = blocking::get(url)?;
Ok(response.bytes()?.to_vec())
}
fn save_image_data(image_data: &HashMap<String, Vec<u8>>, path: &str) -> io::Result<()> {
let encoded: Vec<u8> = bincode::serialize(image_data).expect("Serialization failed");
let mut file = File::create(path)?;
file.write_all(&encoded)?;
Ok(())
}
fn load_image_data(path: &str) -> io::Result<HashMap<String, Vec<u8>>> {
let mut file = File::open(path)?;
let mut buffer = Vec::new();
file.read_to_end(&mut buffer)?;
let decoded: HashMap<String, Vec<u8>> =
bincode::deserialize(&buffer).expect("Deserialization failed");
Ok(decoded)
}
// binary focused on PDF generation
fn main() -> Result<(), Box<dyn std::error::Error>> {
let image_data: HashMap<String, Vec<u8>> = load_image_data("images.bin")?;
let mut posts: Vec<Post> = serde_json::from_reader(File::open("backup.json")?)?;
posts.sort_by_key(|post| NaiveDate::parse_from_str(&post.date, "%A %d %B %Y").ok());
let newlines_re = Regex::new(r"\s*\n\s*").unwrap();
let font_family =
genpdf::fonts::from_files("./fonts", "OpenSans", None).expect("Failed to load font family");
let mut decorator = genpdf::SimplePageDecorator::new();
decorator.set_margins(10);
let mut doc = genpdf::Document::new(font_family);
doc.set_paper_size(genpdf::PaperSize::A4);
doc.set_page_decorator(decorator);
let bar = ProgressBar::new(posts.len() as u64);
bar.set_style(
ProgressStyle::default_bar()
.template(
"{spinner:.green} [{elapsed_precise}] [{bar:40.cyan/blue}] {pos}/{len} {msg}",
)?
.progress_chars("#>-"),
);
for post in &posts {
bar.set_message(format!("Processing post: {}", post.title));
doc.push(elements::PageBreak::new());
let title_style = style::Style::new().with_font_size(24).bold();
let title_element = elements::Paragraph::new(post.title.clone().replace("&", "&"))
.aligned(Alignment::Left)
.styled(title_style);
doc.push(title_element);
let date_style = style::Style::new().with_font_size(12).bold();
let date_element = elements::Paragraph::new(post.date.clone())
.aligned(Alignment::Left)
.styled(date_style);
doc.push(date_element);
doc.push(elements::Break::new(1));
let cleaned_content = newlines_re.replace_all(&post.content, "\n");
let content_lines = cleaned_content.lines();
for line in content_lines {
let content_element = elements::Paragraph::new(line.replace("&", "&").to_string());
doc.push(content_element);
}
from_dynamic_image()
method expecting
a Vec of bytes, which is perfect for how I've saved image data.
create()
method to render everything in a PDF
file written to disk.
for image_url in &post.images {
doc.push(elements::Break::new(1));
match image_data.get(image_url) {
Some(image_data) => {
let image_data = image::load_from_memory(&image_data)?;
if image_data.color().has_alpha() {
continue; // images with alpha channel not supported
}
let image = genpdf::elements::Image::from_dynamic_image(image_data)?
.with_alignment(genpdf::Alignment::Center)
.with_scale(genpdf::Scale::new(3, 3));
doc.push(image);
}
None => {
bar.set_message(format!("Downloading image for: {}", post.title));
match download_image(image_url) {
Ok(image_data) => {
let loaded_image = image::load_from_memory(&image_data)?;
if loaded_image.color().has_alpha() {
continue; // images with alpha channel not supported
}
let image = genpdf::elements::Image::from_dynamic_image(loaded_image)?
.with_alignment(genpdf::Alignment::Center)
.with_scale(genpdf::Scale::new(3, 3));
doc.push(image);
}
Err(e) => {
eprintln!("Failed to download image: {}", e);
}
}
}
}
}
bar.inc(1);
}
bar.finish_with_message("PDF generation complete!");
println!("Rendering and saving PDF, this may take a few minutes...");
let output_file = File::create("Blog.pdf")?;
doc.render(&mut std::io::BufWriter::new(output_file))?;
println!("All done!");
Ok(())
}
Rust is such an enjoyable language to work with, and its enthusiastic community and availability of crates makes it my go to for any project idea.