078384a723
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.19.0 to 0.23.0. - [Commits](https://github.com/golang/net/compare/v0.19.0...v0.23.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> |
||
---|---|---|
client | ||
files | ||
test | ||
.gitignore | ||
go.mod | ||
go.sum | ||
justfile | ||
LICENSE | ||
main.go | ||
README.md |
Mastodon markdown archive
Fetch a Mastodon account's posts and save them as text files using Mastodon's statuses API.
This program essentially wraps the Mastodon API with a command line interface with some additional features.
Features
- Supports all parameters in Mastodon's statuses API
- Convert post to markdown
- Customize output file location, name, and extension
- Customize output format and front matter
- Optionally download of post media
- Optionally threading of posts
- Optionally filter based on post visibility
- Optional affordances for scripting
- Optionally persist fetched post id cursors
- Optionally set authorization token to fetch private posts
I use this tool to create an archive of my Mastodon posts and syndicate them to my own site, per IndieWeb's PESOS philosophy. Read more about this on my site.
This repository is mirrored in Codeberg.
It is likely that I have not considered all possible use cases, or that I've been opinionated in ways that are not as generalizable as I thought. If there's something that looks to be broken or missing, do let me know! For issues, questions, or requests, please contact me via email or on Mastodon.
Table of contents
Installation
Go is required for installation.
You can clone this repo and run go build main.go
in the repository's directory, or you can run go install git.garrido.io/gabriel/mastodon-markdown-archive@latest
to install a binary of the latest version.
Dependencies
This tool has only two direct dependencies, which are included to provide useful, though largely optional, functionality in templates:
The default template makes use of html-to-markdown
to transform the post's HTML content to markdown.
Usage
Usage of mastodon-markdown-archive:
-dist string
Path to directory where files will be written (default "./posts")
-download-media string
Path where post attachments will be downloaded. Omit to skip downloading attachments.
-exclude-reblogs
Mastodon API parameter: Filter out boosts from the response
-exclude-replies
Mastodon API parameter: Filter out statuses in reply to a different account
-filename string
Template for post filename
-limit int
Mastodon API parameter: Maximum number of results to return. Defaults to 20 statuses. Max 40 statuses (default 40)
-max-id string
Mastodon API parameter: All results returned will be lesser than this ID. In effect, sets an upper bound on results.
-min-id string
Mastodon API parameter: Returns results immediately newer than this ID. In effect, sets a cursor at this ID and paginates forward.
-only-media
Mastodon API parameter: Filter out status without attachments
-persist-first string
Location to persist the post id of the first post returned
-persist-last string
Location to persist the post id of the last post returned
-pinned
Mastodon API parameter: Filter for pinned statuses only
-porcelain
Prints the amount of fetched posts to stdout in a parsable manner
-since-id string
Mastodon API parameter: All results returned will be greater than this ID. In effect, sets a lower bound on results.
-tagged string
Mastodon API parameter: Filter for statuses using a specific hashtag
-template string
Template to use for post rendering, if passed
-threaded
Thread replies for a post in a single file
-user string
URL of Mastodon account whose toots will be fetched
-visibility string
Filter out posts whose visibility does not match the passed visibility value
The only required flags for this program to work is dist
and user
. All other flags are there for Mastodon's API parameters, or to support more complex use cases. See the examples section.
Environment variables
If the MASTODON_AUTH_TOKEN
environment variable is set then this program will set the Authorization
header for the statuses and statuses context API requests. This token only needs the read:statuses
permission.
In the context of the statuses request, this allows you to fetch private statuses that only you can normally see. For example, if a post's visibility is set to "Followers only".
In the context of the status context request for orphaned posts, this allows you to fetch private statuses and surpass the limited amount of ancestors and descendants.
Examples
I use this tool programatically, and I do not want to recreate the archive from scratch each time. I thread posts, exclude replies to others, exclude reblogs, and filter out any post that is not public.
I first used this to generate an archive of all the posts that I had published to date. Then, I run it programatically to archive any new posts made.
Mastodon imposes a maximum limit of 40 posts in this API. With --persist-first
and --persist-last
I can save cursors of the upper and lower bound of posts that were fetched. I then use the API's max-id
, min-id
, and since-id
parameters to get the posts that I need, depending on each case.
Generating an entire archive
mastodon-markdown-archive \
--user=https://social.coop/@ggpsv \
--dist=./posts \
--exclude-replies \
--exclude-reblogs \
--persist-last=./last \
--visibility=public \
--download-media=bundle \
--threaded=true \
--max-id=$(test -f ./last && cat ./last || echo "")
Calling this for the first time will fetch the most recent 40 posts. With --persist-last./last
, the oldest fetched post id will be saved at ./last
. Caling this command again will set the last
cursor to the oldest post of the next 40 posts, and so on.
You can use a simple bash script to automate this process. Adding the --porcelain
flag prints the amount of fetched posts to stdout, which can then be used to continue or stop fetching posts:
#!/bin/bash
while true; do
command="mastodon-markdown-archive --dist=./example \
--exclude-replies=true \
--exclude-reblogs=true \
--user=https://social.coop/@ggpsv \
--porcelain=true \
--visibility=public \
--download-media=bundle \
--threaded=true \
--persist-last=./last \
--max-id=$(test -f ./last && cat ./last || echo '')"
output=$($command)
if [[ "$output" -eq 0 ]]; then
echo "No posts returned. Exiting"
break
fi
echo "Fetched $output posts. Continuing."
sleep 1
done
Getting the latest posts
Having created the entire archive, I now want to run this on a schedule to retrieve only the latest posts.
With --persist-first=./first
, the most recent post id will be saved at ./first
.
Calling this command iteratively will only fetch posts that have been made since then.
mastodon-markdown-archive \
--user=https://social.coop/@ggpsv \
--dist=./posts \
--exclude-replies=true \
--exclude-reblogs=true \
--visibility=public \
--download-media=bundle \
--threaded=true \
--persist-first=./first \
--since-id=$(test -f ./first && cat ./first || echo "")
Threading
By default, posts by the author in reply to another post by the author will be written out as separate files.
Alternatively, posts can be threaded together using the --threaded=true
flag. With threading, the descendants of a post will not be written out as a separate files. Instead, only the top post will be written out.
The program will aggregate the post's descendants in reverse chronological order and make them available in the template via the Descendants method. This can be used in templates to render threaded posts as a single post, which the default template does.
When threading, the AllMedia
and AllTags
methods will yield the aggregated MediaAttachment and Tag, respectively.
When the --visibility
flag is used, only the top post's visibility is evaluated. This is done explicitly to support the common practice in Mastodon of setting threaded replies as unlisted
.
Orphaned posts
Mastodon limits their statuses API to a maximum 40 posts at a time, and the --limit
flag can be used to limit this further.
Because of this limit, it is possible that posts in a thread end up split across different responses. Or, a user may maintain a long-lived thread of posts that gets updated sporadically and thus rarely will a single batch of posts have all the descendants of the post.
An orphaned post is a post whose parent is not within a batch of posts returned by a single API call.
In either case, the program will fallback to using the status context endpoint to rebuild the corresponding thread from the top.
Templating
The contents of the file and the filename for each post can be customized using templates. This provides enough flexibility to use this tool for various purposes. The templates are evaluated as Go text templates, so it should be possible to do anything that's normally supported in a Go template.
For example, if you're using this to syndicate posts to a site built using a static site generator, you can customize the output so that it adheres to specific requirements around front matter structure or filename formats.
Post
Out of the box, this tool uses the post.tmpl template to create the post file. It converts the post content to markdown, threads replies, and defines some attributes in the front matter using YAML.
For example, this post is converted to this markdown file:
---
date: 2024-04-24 12:40:10.029 +0000 UTC
post_uri: https://social.coop/users/ggpsv/statuses/112326240503555949
post_id: 112326240503555949
tags:
- FrameworkLaptop
- fedora
---
Back at dual-booting on the [#FrameworkLaptop](https://social.coop/tags/FrameworkLaptop). Last time it was Ubuntu, but now I have gone with [#Fedora](https://social.coop/tags/Fedora) 40 KDE.
I'm impressed with how things just work with this laptop. Major props to the [@frameworkcomputer](https://fosstodon.org/@frameworkcomputer) team for supporting these distros out of the box.
I simply decrypted my drive, shrunk it, created a partition, booted off a USB key, installed Fedora, encrypted both partitions, and that's it.
Also, KDE Plasma 6 looks incredibly crisp on this screen.
A different template can be used by passing its path to --template
. The template must comply with Go template syntax.
For example, a jekyll.tmpl
template with customized front matter :
---
layout: post
title: {{ substr 0 5 .Post.Id }}
published: true
---
{{ .Post.Content | toMarkdown }}
Passed to the command as --template=./jekyll.tmpl
will instead yield a file that looks like this:
---
layout: post
title: 11232
published: true
---
Back at dual-booting on the [#FrameworkLaptop](https://social.coop/tags/FrameworkLaptop). Last time it was Ubuntu, but now I have gone with [#Fedora](https://social.coop/tags/Fedora) 40 KDE.
I'm impressed with how things just work with this laptop. Major props to the [@frameworkcomputer](https://fosstodon.org/@frameworkcomputer) team for supporting these distros out of the box.
I simply decrypted my drive, shrunk it, created a partition, booted off a USB key, installed Fedora, encrypted both partitions, and that's it.
Also, KDE Plasma 6 looks incredibly crisp on this screen.
You might even want to use HTML as the output and thus pass a --template=./html.tmpl
flag for a html.tmpl
template that looks like this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>{{ .Post.Id }}</title>
</head>
<body>
{{.Post.Content}}
</body>
</html>
Filename
Out of the box, this tool uses the post's id and the .md
extension for the filename. For example, this post is saved 112326240503555949.md
A different format for the filename can be used by passing a template string to --filename
. The string must comply with Go template syntax.
For example, to create post files that are prefixed with the post's creation date in YYYY-MM-DD
format and suffixed with the post id, pass --filename='{{.Post.CreatedAt | date "2006-01-02"}}-{{.Post.Id}}.md
.
An extension in the filename template will be used if present. Otherwise, .md
is used as the default file extension.
Following the HTML example in the post template section above, you may customize the filename as --filename='{{.Post.Id}}.html'
to use HTML as the output file extension.
Available functions and variables
For both the post and filename templates, the following functions and variables are available:
Functions
- Standard Go template functions
- All Sprig functions
toMarkdown
to convert the post's HTML content to Markdown, without escaping any markdown syntaxtoMarkdownEscaped
to convert the post's HTML content to Markdown, escaping any markdown syntax
Sprig is particularly useful for arbitrary customization, such as string manipulation.
For example, let's assume we want to convert the casing of the tags from Mastodon. The default template passes the tags as-is, but we want them in kebab-case. You would need to create a custom template, and use the kebabcase
function where the tags are rendered:
---
date: {{ .Post.CreatedAt }}
tags:
{{- range .Post.AllTags }}
- {{ .Name | kebabcase }}
{{- end }}
---
You would keep this file somewhere, and pass its path to the --template
argument when invoking the tool.
Variables
Template examples
Here are some examples for basic templates that can be used. For an example on threading replies, see the default template.
For any of these, save the template file somewhere and pass its path to the command as a value of the --template
flag.
For the filename, pass it as a string to the command as a value of the --filename
flag.
Jekyll
Template:
---
layout: post
title: {{ .Post.Id }}
---
{{ .Post.Content | toMarkdown }}
Filename:
{{.Post.CreatedAt | date "2006-01-02"}}-{{.Post.Id}}.md
Hugo and 11ty
The default template and filename is built for Hugo as that's the static site generator that I use, but a minimum viable template that works for either can look like this:
---
title: {{ .Post.Id }}
date: {{ .Post.CreatedAt | date "2006-01-02" }}
---
{{ .Post.Content | toMarkdown }}
Filename:
{{.Post.CreatedAt | date "2006-01-02"}}-{{.Post.Id}}.md
HTML
Template:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>{{ .Post.Id }}</title>
</head>
<body>
{{.Post.Content}}
</body>
</html>
Filename: {{ .Post.Id }}.html
Text only
Template:
{{ .Post.Content }}
Filename: {{ .Post.Id }}.txt
Post media
By default, a post's media is not downloaded. Use the --download-media
flag with a path to download a post's media. The post's original file is downloaded, and the image's id is used as the filename.
For example, --download-media=./images
saves any media to the ./images
.
Once downloaded, the media's path is available in MediaAttachment.Path as an absolute path.
Sprig's path functions can be used in the templates to manipulate the path as necessary. For example, the default template uses osBase
to get the last element of the filepath.
Bundling
You can use --download-media=bundle
to save the post media in a single directory with its original post. In this case, the post's filename will be used as the directory name and the post filename will be index.{extension}
.
For example, --download-media="bundle" --filename='{{ .Post.CreatedAt | date "2006-01-02" }}-{{.Post.Id}}.md'
will create a YYYY-MM-DD-<post id>/
directory, with the post saved as YYYY-MM-DD-<post id>/index.md
and media saved as YYYY-MM-DD-<post id>/<media id>.<media ext>
.
This is done specifically to support Hugo page bundles.
Known issues
- A reply post may still appear in the list of posts despite using
--exclude-replies
. This happens when the post in question is a reply to a post that has since been deleted. It looks like Mastodon's API stops treating the reply as a reply. It no longer points to another post, and thus is not affected by theexclude_replies
parameter.