Maven @ IPFS
Categories:
Projects:
Lately has been toying with IPFS to achieve content sharing without centralized infrastructure. In other words, instead to free-ride on some (centralized) infrastructure that may be a public good, or some commercial offering, solve the publishing (and also “owning” and “hosting” the data) by my self. Kinda reminded me this to early 2000s, when many ran Maven repositories in their own basements, making access to their repositories fragile, with longevity accessibility and uptimes totally unpredictable. That was before Central, which consolidated but also promoted itself as single point of failure. And hence, ended up as over- and misused, see Maven Central and the Tragedy of the Commons.
Hence, I wanted plan B for our Maveniverse organization. What if we – aside of continued publishing to Central – offer other ways to get artifacts for those interested in them, let them keep them in a way they want (not possible with Central, mirroring is still not an option), and have them full access to artifacts (ie indexing, scanning, whatever)?
Enter IPFS
I will not spend a lot on explaining IPFS, it has a nice documentation available already. If anything else, at least read this page.
In short, IPFS is a decentralized network, running IPFS nodes and implementing Content Addressable Storage (CAS) and more. Terms worth knowing about are CID, that in very simplified form, can be understood as a “hash” (content solely content hash), pointing to some content (any content). The content behind CID can be one of multiple things (but once set, is immutable): it can point to contents of a JAR file, or in fact any file, but it can be a DAG backed by IPLD schema file system. This means that CID may point even to “file hierarchy”, like on Linux systems, with directories and files and everything. These structures are colloquially called “Merkle Trees”, and are built “bottom up”, from leaves (files) toward root. Hence, in case of change (ie another file added to a directory), the existing file CID remains unchanged, but due their shared parent directory changed, its CID and root CID will change. Producing CID is free to everyone: just run an IPFS node and upload some content to it, and you will get a CID for it. Important thing to consider, is that if two persons, independently, upload some content and both end up with same CID (with some fine print, later about that), it means they both independently published bit-by-bit same content (IPFS is content addressable).
Having the CID is like having the “address”, but where is the content behind the address? For start, it is in the node you uploaded the content to get the CID. With “pinning”, you can make your node pull down any CID backing content (assuming is reachable). Basically you maintain your node, letting it stick to content you want/need, and merely caching the rest (IPFS node storage performs regular garbage collections, dropping unreachable or stale content). Furthermore, there are (paid or free) services for “pinning content”, by using those services, you can make sure content is “pinned” and swiftly served to any node wanting it. But this is out of scope for this article.
Another term is IPNS, that is like “mutable CID” (maybe consider it like DNS). For creating IPNS entries a private cryptographical key is required, and each key can produce one IPNS entry. At the same time, one node (or user) can mangle as many keys as they want. And this is important thing: as I explained above, anyone can create CID, but if consumer asks IPNS “which is CID you published”, all user has to do is resolve IPNS to get the right CID, as IPNS entry is the function of private key, and it cannot be faked. Nor CID or IPNS is “human friendly”, kinda, but there are solutions to it like DNSLink where IPNS can be exposed via DNS and “human friendly” domain.
Have to note several key things to reassure IPFS users:
- IPFS works in similar fashion as Torrent, uses DHT and various other means to let nodes discover each other. In short, the more nodes the merrier. In other words, “popular” content may be cached on multiple nodes, and hence, will be faster getting them.
- Each IPFS node participates in traffic direction (ie passing messages) among each other, telling about discovered nodes, and offering local content, if asked for.
- Important thing to note, is that if you run your node it will store only the content you tell it to store (by locally pushing or pinning), it will NOT store random content from internet.
- There is pretty much nothing needed on your side (network setup or alike) to make IPFS published.
Rounding it up
So what gives this?
- CID points to content/structure and is immutable (same CID will always return same content, if content is accessible)
- IPNS points to “up to date” CID, and you can be sure entry was published by key owner only and nobody else
- DNS points to IPNS, and again, assuming you trust the domain owner, you can then delegate your trust to IPNS entry it points to. This trust delegation is very similar to Central, where you need to provide proof to get publishing namespace (that is ideally a reverse domain).
In short, we have a series of indirection: domain -> IPNS -> CID. If you get to CID by hopping over these stops, all
fine. But what happens if your private key (used for publishing IPNS) is compromised? Just create new key, republish the
content with it, and update DNSLink for your domain (and of course, communicate it). After all, we still can GPG sign
artifacts, so IPNS + GPG is good enough.
An example of this setup can be seen at ipfs.maveniverse.eu.
Details:
- the
ipfs.maveniverse.eudomain uses DNSLink to publish TXT record with IPNS (trydig _dnslink.ipfs.maveniverse.eu TXT) - the IPNS entry is in form of
/ipns/xxxxxxxxxx - the IPFS node can resolve
/ipns/xxxxxaddress to/ipfs/xxxxxCID.
Maven @ IPFS
Maven release repositories seems like perfect candidates to be put onto IPFS: they are immutable. Or to be more precise, the leaves (artifacts) in a repository will remain immutable, and by deploying (more) only parent and paren parent changes, in essence the root CID changes. Aside of parents, G and A level metadata changes as well, but those are not leaves.
Maveniverse IPFS extension provides support for this setup above, and is even usable on CI to consume (and later to deploy).
The extension requirements are Java 11+ and a reachable Kubo RPC API (simplest is to have it running on localhost) and adds following components to Maven:
- adds IPFS transporter supporting
ipfs:/URLs - adds IPFS publishing support via lifecycle participant
The IPFS URL looks like ipfs:/name[/subpath] where parts are:
- protocol
ipfs:/is fixed and must - for consuming, the
nameelement should be resolvable. It can be CID, IPNS or DNSLink-ed domain. - for deploying, the
nameelement aside of that above, should be the name of a private key present in IPFS node used to publish IPNS - optional
/subpathdefines the path prefix withinname
Have to mention, that if using CID for name, it is user responsibility to ensure proper CID is used, since as explained
above, CIDs can be created by anyone, and it may contain fake or even malicious artifacts. When using IPNS record,
similar thing, user has to ensure that he resolves the proper IPNS (but if trust is established, all is good). Finally,
in case of using (IPFS resolvable) domain, same level of trust can be established as in case of Central, one can
safely assume that domain owner publishes right thing (same as on Central).
A little bit of digression here: in case name is a domain, I was tinkering to limit Maven @ IPFS to get only artifacts
from domains namespace, for example maveniverse.eu should offer only eu.maveniverse namespace. Any ideas welcome!
Important: the current workflow Maven @ IPFS implements works of small scale, ie Maveniverse forge level, that has handful of Megabytes of artifacts. The current workflow due refresh/pinning (downloading whole blob) does not scale, but works pretty nicely on small scale.
Mimir @ IPFS
As mentioned above, in repositories published with Maven @ IPFS, the leaves will not change. That means that their CID remains unchanged. Next level would be IPFS global caching, for example with Mimir (that already offers similar service on LAN using JGroups). Here, some translation needs to be done, that begins with GAV and ends with CID.
Final thoughts
I am not planning to pull back Maven publishing back to “stone age” (the 2000s, with many little repositories coming and going), but IPFS definitely has merit in several use cases. Some of them, just without any ordering are:
- publishing snapshots across (probably totally disconnected) CI jobs
- any ad-hoc kind of publishing without infrastructure
- small scale forges
But I personally think that (global) caching w/ Mimir may be very interesting.
Once something more in place, will report back! Cheers and have fun!