Python 動態提取中文字型

前言

最近偶然發現兩款開放原始碼的字型，JetBrains Mono、jf open 粉圓。

前者是專門針對程式設計師打造，對於一些符號提供「連字」效果，適合做為程式碼閱讀使用。

後者支援繁體中文與前者的圓潤相互呼應，搭配起來十分好看。

以往沒有著墨於部落格的中文字型，是由於中文字型的檔案通常較大，無法提供流暢的閱覽體驗。

我靈光一閃，不如從中文字型中，只提取出文章有使用的中文字，再壓縮成 woff 檔，供部落格使用。

7.4MB 的字型經瘦身只剩不到 1MB，加上瀏覽器快取，閱覽體驗還算順暢，整體效果我很滿意。

此部落格為靜態網頁，所以該方法適用，以下使用 Python 實作。

實作過程

安裝軟體包

pip install fonttools

tc_woff.py

首先，匯入所需的模組。

import glob, re
from fontTools import subset

接著，glob.iglob()掃描部落格所有靜態網頁的 html 檔取得其路徑。

路徑參數**表示含子目錄，recursive=True允許遞迴搜尋所有子目錄。

依序將 html 讀出，再re.findall()搭配正則表達式提取出全形字後，set()刪除重複項目。

刪除後回傳串列（list），並將其加進html_chinese，重複執行，直到全部的 html 檔皆讀取完畢。

由於串列html_chinese不是可哈希的（hashable），所以要將其轉成元組（tuple）才能set()。

再一次刪除重複項目後，join()將串列轉為字串供後續使用。

def find_chinese(file):
    chinese = re.findall(r"[^\x00-\xff]", file)
    return list(set(chinese))

html_path = "/Blog/yuripe-murmur/public/**/*.html"
html_all = glob.iglob(html_path, recursive=True)

html_chinese = []

for html in html_all:
    with open(html, "r", encoding="utf-8") as file:
        html_chinese += find_chinese(file.read())

html_chinese = "".join(set(tuple(html_chinese)))

print(html_chinese)
print(len(html_chinese))

最後，使用subset的相關功能，從指定的字型檔中提取出對應的字元，並合併壓縮為 woff 檔。

options = subset.Options()
font = subset.load_font("/TC_woff/jf-openhuninn.ttf", options)
subsetter = subset.Subsetter(options)
subsetter.populate(text = html_chinese)
subsetter.subset(font)
options.flavor = "woff"
assets_path = "/Blog/yuripe-murmur/themes/hugo-theme-hello-friend/static/assets/jf-openhuninn.woff"
subset.save_font(font, assets_path, options)