Self-host Reddit – 2.38B posts, works offline, yours forever

19-84@lemmy.dbzer0.com · 2 months ago

Self-host Reddit – 2.38B posts, works offline, yours forever

Appoxo@lemmy.dbzer0.com · 2 months ago

People will do anything to use Reddit instead of just letting go.

communism@lemmy.ml · 2 months ago

This is just an archive. No different from using the wayback machine or any other archive of web content.

Appoxo@lemmy.dbzer0.com · 2 months ago

You still use Reddit in some capacity.

Or would you deny watching a movie just because you watched it on your local Jellyfin folder instead of watching it on Netflix or the cinema?

ᴍᴜᴛɪʟᴀᴛɪᴏɴᴡᴀᴠᴇ @lemmy.dbzer0.com · 2 months ago

There is a ton of useful info on Reddit. I don’t use it anymore either but I’ll be downloading this project.

Appoxo@lemmy.dbzer0.com · 2 months ago

I never said I am not using it.
But that feels like it’s a compromise to keep using it as native as possible.
If it was just for research purposes, accessing archive.org would suffice.

ᴍᴜᴛɪʟᴀᴛɪᴏɴᴡᴀᴠᴇ @lemmy.dbzer0.com · 2 months ago

I think the idea here is to have it offline in the event of further fascist control of the internet. There is really so much useful information on there on a wide variety of topics. I don’t care about backing up memes and bot drivel.

19-84@lemmy.dbzer0.com · 2 months ago

that was exactly the idea, thanks for understanding…

also reddit’s ban on vpn also reddit’s mandatory id verification

and the list goes on…

pixeltree@lemmy.blahaj.zone · 2 months ago

“Stop talking to my clone, I specifically requested you never contact me again”

It’s an archive of reddit, not reddit

inspxtr@lemmy.world · 2 months ago

Very cool! Do you know how your project may compare with arctic shift ? For those more interested in research with reddit data, is there benefit of one vs another?

frongt@lemmy.zip · 2 months ago

And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.

19-84@lemmy.dbzer0.com · 2 months ago

Yes! Too many comments to count in a reasonable amount of time!

douglasg14b@lemmy.world · 2 months ago

Yeah, it should inflate to 15TB or more I think

muusemuuse@sh.itjust.works · 2 months ago

If only I had the space and bandwidth. I would host a mirror via Lemmy and drag the traffic away.

Actually, isn’t the a way to decentralize this that can be accessed from regular browsers on the internet? Live content here, archive everywhere.

psycotica0@lemmy.ca · 2 months ago

Someone could format it into essentially static pages and publish it on IPFS. That would probably be the easiest “decentralized hosting” method that remains browsable

breakingcups@lemmy.world · 2 months ago

Just so you’re aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

Not to detract from your project, which looks cool!

19-84@lemmy.dbzer0.com · 2 months ago

Yes I used AI, English is not my first language. Thank you for the kind words!

mustlane@lemmy.zip · edit-2 2 months ago

Removed by mod

idealism_nearby@lemmy.world · 2 months ago

Would love to see you learn an entire foreign language just so you are able to communicate with the world without being laughed at by people as hostile as yourself.

fartographer@lemmy.world · 2 months ago

I can’t even learn my own language!

potustheplant@feddit.nl · 2 months ago

They said it wasn’t their “first” lanugage. Which leads me to believe that they do speak English. If that’s the case, then they indeed are kind of lazy. There have already been studies in the impact of AI when used for communication and the results are not positive.

This isn’t something I’d personally point out and criticize, just something I wouldn’t do personally. Take the time to express your own ideas in your own words. The long term cost is higher than the short term gains.

rumba@lemmy.zip · 2 months ago

Hey I drove to the library, picked up all these things you needed, got dinner here ya go, free!

You drove? man that’s lazy…

He used AI to clean up translation and save time after he spent a fuck ton of time curating and delivering us a helpful product. Calling him out as lazy is an awful take.

potustheplant@feddit.nl · 2 months ago

First, that’s an awful analogy.

Second, you’re assuming (for some unknown reason) that they “cleaned up” the “translation” using ai. You have literally no idea exactly how they wrote the post. It’s kinda weird to make up a random scenario but ok.

Third, no, it’s not an awful take. You can code something that requires a ton of effort but write awful documentation. One thing does not make the other impossible.

Fourth, I already explained that there have already been studies that concluded that using AI to write stuff for you has a negative impact on your communication skills. This is not an opinion or me being ingrateful or whatever. I was just sharing information.

rumba@lemmy.zip · 2 months ago

If that documentation was awful, I’d REALLY like to see your take on NixOS :)

19-84@lemmy.dbzer0.com · 2 months ago

there are the so called activists that complain alot then there are the activists that deliver projects and code… enough said

potustheplant@feddit.nl · 2 months ago

“Activists”? What are you even talking about?

Regardless, I specifically said that what you did wasn’t wrong or anything likw that. I simply think that it’s going to do you more harm than good in the long run. You’re free to do whatever you want though, obviously.

Another piece of advice. When someone simply shares an opinion, don’t get instantly butthurt over nothing. Otherwise this might as well be reddit.

lad@programming.dev · 2 months ago

I have A1 and A2 level in a couple of non-first languages, technically I can speak those, realistically I don’t and will not be able to communicate something more complex than ‘here, take a look’

So I don’t agree with your absolutistic stance

potustheplant@feddit.nl · 2 months ago

There’s nothing “absolutistic” about my “stance”. If you’re rusty using a language, you won’t get better if someone else does the homework for you. Make an effort, make mistakes, write in a way that sounds weird, who cares. But practice. If you only take the easy way out, that’ll be your only option in the future.

Although, like I already said, that’s MY way of thinking about it. If you want to use ai to write your stuff, you do you. It doesn’t negate the fact that, whle it’s not “wrong”, it’s the lazy (or minimum effort) option. Don’t know why it bothers you so much.

MadMonkey@lemmy.world · 2 months ago

Brush, you do not seem like a nice person to be around.

Spread love and kindness, not hate.

I hope you have a better rest of your day.

Leah@piefed.blahaj.zone · 2 months ago

Shut the fuck up loser.

irmadlad@lemmy.world · 2 months ago

Yu mussi bawn backacow

Melvin_Ferd@lemmy.world · 2 months ago

You’re awesome. AI is fun and there’s nothing wrong with using it especially how you did. Lemmy was hit hard with AI hate propaganda. China probably trying to stop it’s growth and development in other countries or some stupid shit like that. But you’re good. Fuck them

rumba@lemmy.zip · 2 months ago

Yup, if there was ever a decent use for AI, this is it. Lemmy can (and will) hate the shit out of it, but it took a little burden off the shoulders of someone doing us a great service.

Melvin_Ferd@lemmy.world · 2 months ago

I fucking hate lemmy sometimes.

SteveCC@lemmy.world · 2 months ago

Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

19-84@lemmy.dbzer0.com · 2 months ago

thank you!!! i built on great ideas from others! i cant take all the credit 😋

Howlinghowler110th@kbin.earth · 2 months ago

I think this is a good use case for AI and Impressed with it. wish the instructions were more clear how to set up though.

19-84@lemmy.dbzer0.com · 2 months ago

thank you! the instruction are little overwhelming, check out the quickstart if you haven’t yet! https://github.com/19-84/redd-archiver/blob/main/QUICKSTART.md

vane@lemmy.world · 2 months ago

How long it takes to download this 3TB torrent ?

19-84@lemmy.dbzer0.com · 2 months ago

week(s)

vane@lemmy.world · 2 months ago

Thank you for answer. I think I do this one instead https://academictorrents.com/details/30dee5f0406da7a353aff6a8caa2d54fd01f2ca1 Looks like it’s divided by year-month.

19-84@lemmy.dbzer0.com · 2 months ago

those are not split by subreddit so they will not work with the tool

Butterphinger@lemmy.zip · 2 months ago

grabs external

19-84@lemmy.dbzer0.com · 2 months ago

PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

Bazell@lemmy.zip · 2 months ago

We can’t share this on Reddit, but we can share this on other platforms. Basically, what you have done is you scraped tons of data for AI learning. Something like “create your own AI Redditor” . And greedy Reddit management will dislike it very much even if you will tell them that this is for the cultural inheritance. Your work is great anyway. Sadly, that I do not have enough free space to load and store all this data.

El Barto@lemmy.world · 2 months ago

Anyone doing this will be banned in that platform.

Avid Amoeba@lemmy.ca · 2 months ago

How does this compare to redarc? It seems to be similar.

19-84@lemmy.dbzer0.com · 2 months ago

redarc uses reactjs to serve the web app, redd-archiver uses a hybrid architecture that combines static page generation with postgres search via flask. is more like a hybrid static site generator with web app capabilities through docker and flask. the static pages with sorted indexes can be viewed offline and served on hosts like github and codeberg pages.

Avid Amoeba@lemmy.ca · 2 months ago

Is there difference in how much storage space is needed between the two approaches?

19-84@lemmy.dbzer0.com · 2 months ago

redd-archiver will take up more disk space because the database exists along with the static html

😈MedicPig🐷BabySaver😈@lemmy.world · 2 months ago

Fuck Reddit and Fuck Spez.

muusemuuse@sh.itjust.works · 2 months ago

You know what would be a good way to do t? Take all that content and throw it on a federated service like ours. Publicly visible. No bullshit. And no reason to visit Reddit to get that content. Take their traffic away.

El Barto@lemmy.world · 2 months ago

Where would it be hosted so that Conde Nast lawyers can’t touch it?

muusemuuse@sh.itjust.works · 2 months ago

What would they say? It’s information that’s freely available, no payment required, no accounts to simply read it, no copyrights, where’s the legal in hosting a duplicate of the content?

El Barto@lemmy.world · 2 months ago

Oh I agree with you, friend. The problem is that they’ll say that they’re losing ad revenue. So they’ll try and sue, even if they’re in the wrong.

muusemuuse@sh.itjust.works · 2 months ago

Fine, decentralize it then. And fuck your ad revenue, nobody likes you, Spez!

limelight79@lemmy.world · 2 months ago

It might fall under the same concept that recipes do - you can’t copyright a recipe, but a collection of recipes (such as a book) is copyrightable.

In any case, they have a lot more money to pay lawyers than you or I do, I’ll bet, so even if you are right, that doesn’t mean you’ll have the money to actually win.

muusemuuse@sh.itjust.works · 2 months ago

So distribute it and n a fault tolerant way. They can’t sue all of us.

Clbull@lemmy.world · 2 months ago

Eww, Voat and Ruqqus.

19-84@lemmy.dbzer0.com · 2 months ago

i will always take more data sources, including lemmy!

polarity_inverter@startrek.website · 2 months ago

… for building your personal Grok?

19-84@lemmy.dbzer0.com · 2 months ago

if you didn’t notice, this project was released into the public domain

I Cast Fist@programming.dev · 2 months ago

What’s the size difference when you remove the porn stuff from the torrent?

Spice Hoarder@lemmy.zip · 2 months ago

Willing to bet a 90% size reduction

mustlane@lemmy.zip · edit-2 2 months ago

Removed by mod

irmadlad@lemmy.world · 2 months ago

spoiler

Maybe read where OP says ‘Yes I used AI, English is not my first language.’ Furthermore, are ethnic slurs really necessary here?

Cybersteel@lemmy.world · 2 months ago

Then he’s no better than Reddit who also uses AI no?

El Barto@lemmy.world · 2 months ago

I disagree. I don’t like AI slop. But he’s using AI here in a way that is very much intended. I want to share something in Mandarin, I don’t know Mandarin. If only there was a way to transform my thoughts into Mandarin…

pixeltree@lemmy.blahaj.zone · 2 months ago

Using ai to help normal everyday people cross language barriers is one of the few good ethical uses for it. I hate ai and it’s implications as mich as the next gal but this is clearly fine

irmadlad@lemmy.world · 2 months ago

How many languages do you know fluently? I get that people have a definite opinion about AI. Like I told another Lemmy user, I have a definite opinion about the ‘arr’ stack which conservatively, 75% of selfhosters run. However, you don’t hear me out here beating my tin pan at the very mention of the ‘arr’ stack. Why? Because I assume you all are autonomous adults, capable of making your own decisions. Secondly, wouldn’t that get a bit tedious and annoying over time? If you don’t like AI, don’t use it ffs. Why castigate individuals who use AI? What does that do? I would really like to know what denigrating and browbeating users who use AI accomplishes.

euAppleHater@feddit.org · 2 months ago

Wait, do you have an issue with piracy in general or an issue with the arr attack specifically? No judgement or interest in argument, just genuinely curious. Feel free to dm if you don’t want to start a whole thing, or beat your tin pan as you said, in an unrelated post.

irmadlad@lemmy.world · 2 months ago

Wait, do you have an issue with piracy in general

I don’t mind stating here: Piracy in general. I don’t condemn those who do because, as I’ve said, you are autonomous adults capable of making your own decisions. You know the risks and you take steps to mitigate those risks. You and I, have both heard all the pros and cons and all the supporting arguments of both sides. Now, I know there are lots of people who rip and catalog their own DVD, CDs, etc. All fine and dandy.

The comparison was that every time AI is used here in this comm, or even suspected of use, people have a conniption and start piling on. Like moths to a flame. What does that accomplish? Nothing. It seems to just make those who are anti-AI feel superior, is about all I can get from it. To me, it’s just a tool. I’ll grant you it’s a tool that needs some heavy regulation, even as much as I chafe against regulation. It is necessary. AI isn’t going away. It’s not a fad. It’s here to stay. If using AI makes your blood boil, fine. Don’t. Although I foresee a time where you’ll use AI and not even know it.

Opinions are great too. I, like others, have a long list of them. Stating your opinions is fine too. It seems here tho, opinions turn into castigation and denigration, which is in direct violation of ‘Rule 1: Be civil: we’re here to support and learn from one another.’ State your opinion on AI: ‘I’d rather guide my pops into my mum before I’d use AI’. Then move on. Personally, I don’t state my opinion on the arr stack, because it would accomplish nothing and in the long run become tedious and obnoxious.

As far as the arr stack as software, I’ve never deployed it, but it is pretty darn amazing from what I’ve read. The dev teams that have put it all together have some knowledge to say the least. It’s just not my bag.

euAppleHater@feddit.org · 2 months ago

Ahk I see, thanks for the explanation. I assumed it was a general issue with piracy, but was wondering if maybe I had missed something negative about the software specially or the contributors behind it or something.

Tiger@sh.itjust.works · 2 months ago

What is the timing of the dataset, up through which date in time?

19-84@lemmy.dbzer0.com · 2 months ago

2005-06 to 2024-12

however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.

Tiger@sh.itjust.works · 2 months ago

Thank you very much, very cool.

douglasg14b@lemmy.world · 2 months ago

It’s literally says in the link. Go to the link and it’s the title.

Tiger@sh.itjust.works · 2 months ago

Oh I didn’t see it. I’m sorry I asked.

Self-host Reddit – 2.38B posts, works offline, yours forever

Self-host Reddit – 2.38B posts, works offline, yours forever

GitHub - 19-84/redd-archiver: A PostgreSQL-backed archive generator that creates browsable HTML archives from link aggregator platforms including Reddit, Voat, and Ruqqus.

Fuck Reddit and Fuck Spez.