Google Books (or similar) all book scans – $200k bounty (2025)

(software.annas-archive.gl)

177 points | by Cider9986 3 hours ago

17 comments

ahmedfromtunis 1 hour ago
I live in a country where the selection of available books, especially in English, is very limited. Buying online from foreign markets comes with a long list of administrative hurdles and limits.
If it were not for Anna's Archive and Z-Library, I would've never been able to read the books that shaped who I am today, or keep my passion for learning alive.
Thanks, AA and ZLib! (Also, thank you to the authors whose books and knowledge I consumed without being able to pay them back.)
[-]
- jvm___ 56 minutes ago
  https://send.djazz.se/
  This is key for getting epubs to your Kobo.
  [-]
  - ahmedfromtunis 40 minutes ago
    Thanks, but I don't use e-readers as they are not available here.
    I've been using MoonReader for many years now and settled on pretty good parameters that make the reading experience very comfortable on both my phone and my tablet.
  - christofosho 51 minutes ago
    Calibre? https://calibre-ebook.com/
  - pull_my_finger 32 minutes ago
    I don't understand what this is doing. Can't you sideload any ebook onto a kobo anyway? Never had an issue on my Clara
  - andrepd 4 minutes ago
    Handy, but a book lover with an ereader probably already uses Calibre :)
dr_dshiv 1 hour ago
https://SourceLibrary.org has about 16,000 rare books translated — most for the first time. 50,000 books archived (will be translated when we have $$ for it). More tokens than English Wikipedia and about .75 petabytes.
Not sure if we will qualify for a bounty, but happy to share! Btw, we are looking for funding from small or large donors who want to help us translate the Renaissance…
[-]
- sgc 10 minutes ago
  Curious as to what your budget was to get where you are today? That's a lot of tokens. I presume you are using gemini flash?
- wrsh07 29 minutes ago
  Hey, this looks fascinating!
  I can't quickly tell what all you have archived^, but I have some friends who are academic historians who might be interested in certain categories of work (and could help verify some esoteric languages) - is it possible to search by region or language?
  Have you reached out to any types of historians WRT the project? It seems like some PhD students might be able to find some projects in this work etc
  ^ when I looked at the timeline https://sourcelibrary.org/timeline, I got an error
  [-]
  - dr_dshiv 1 minute ago
    Thanks! This is designed with historians and librarians from the Embassy of the Free Mind (embassyofthefreemind.com) — and open to all input.
    When you use it with Claude, it will be able to provide quotes linked to the original pages of the book. Very happy for research grounded in primary sources. Although, there are lots of secondary, tertiary sources too, up to the public domain cut off.
    Please share with historian friends. I’m not great at socials or fundraising but this was really designed to support humanists. It can give DOIs for the versions of the translated books, which means they can be quoted and cited in academic papers.
    Try it in Claude or Claude code (even better)! Just point it towards the source library and ask if it can read it and find quotes about a topic you care about in history.
    Thanks for the feedback, I’ll fix the timeline.
trilogic 2 hours ago
Who is behind Annas archive, there is a lot of english speakers involved in the team and forums! Anyway as long as buying isn´t owning no issues here.
[-]
- Cider9986 41 minutes ago
  I think Anna is behind it.
  https://redlib.catsarch.com/r/Annas_Archive/comments/1f6h74r...
  https://reddit.com/r/Annas_Archive/comments/1f6h74r/im_curio...
DeepYogurt 1 hour ago
Anyone afraid of being laid off at google right now? Perhaps this is a backup :)
[-]
- Cthulhu_ 1 hour ago
  I think if you get caught exfiltrating data they'll sue you for much more than $200K.
  [-]
  - imhoguy 54 minutes ago
    I don't think anybody would do it purely for money. I would rather see someone who is terminally ill and decides to do some "good".
    [-]
    - dlenski 33 minutes ago
      There are not too many mentally-sharp, fully-employed, terminally-ill people that I have met. Even fewer at tech companies.
      And even fewer who are single and childless. (Google would likely go after the estate of anyone who did this.)
      [-]
      - bitmasher9 5 minutes ago
        I wonder how hard they would press an estate. It’s bad PR to go after widows and surviving children, and the data has already escaped.
        This is something they’d want to settle quietly, so the family would have leverage.
      - imhoguy 4 minutes ago
        But the one would be enough, especially in large organization. Surely they would need access to the exact data too.
  - merpkz 1 hour ago
    Copy data into extra large capacity micro sdcard and hide it in your rubiks cube, nobody will suspect a thing
    [-]
    - diab0lic 49 minutes ago
      It’s the “ Copy data into extra large capacity micro sdcard” step that gets you caught. Nobody is stopping you from leaving with an SD card or USB stick at Google.
    - takipsizad 50 minutes ago
      I wish an extra capacity SD card was enough, google books holds (probably) an insane numbers of books
  - the_real_cher 1 hour ago
    If your money is in private crypto or offshore you have nothing to worry about.
    [-]
    - zuzululu 50 minutes ago
      i'd strongly caution anybody foolish enough to go down this path
      financial watchdogs and international treaties make it impossible unless you are perhaps a multi billionaire who can afford to buy people at the political level
    - mock-possum 1 hour ago
      Except perhaps jail time.
      Lying about your assets to avoid paying a lawful fine is criminal. Just because they can’t see your money doesn’t mean they can’t prove that you have it, and can’t jail you for hiding it to get out paying a fine.
      [-]
      - LastTrain 48 minutes ago
        So is stealing
      - LearnYouALisp 54 minutes ago
        Google, Amazon, and FB: It's not me, right
hedora 1 hour ago
I wonder how long it will be before they offer bounties for internet scrapes.
Cloudflare captchas have made the internet unusable for me, and I'm sure it will only get worse over time. I'd much rather just browse (or even torrent) a copy of archive.is or similar. The latter would be much better for privacy, and hey, I run ad blockers anyway.
[-]
- rvnx 1 hour ago
  https://x.com/CloudflareDev/status/2031488099725754821
  Well, there is this little conflict of interest
  [-]
  - aspect0545 1 hour ago
    https://xcancel.com/CloudflareDev/status/2031488099725754821
stephenlf 5 minutes ago
Anna’s archive rocks
bix6 2 hours ago
Piracy / copyright predictions?
The current situation feels untenable with renting. So many regular people I know have learned about VPN, NAS, etc.
[-]
- codemog 1 hour ago
  Hopefully the guillotines. Look up how much the authors and artists who create the actual work get paid.
  [-]
  - 0x3f 56 minutes ago
    Quite a few textbook authors I know are paid well to be part of the whole scheme (kickbacks, forced yearly repurchase for the 'online' component of books, etc). So I think it varies a lot.
- specproc 2 hours ago
  It was never sustainable, just regulatory capture by large IP owners.
  Spotify, Netflix, Amazon etc provided OK value for a while, but now enshitification is biting, this is due a massive comeback.
hereme888 47 minutes ago
The link sort of reads like people who have very easy access to the requested material. Almost like they're Google employees.
neilv 1 hour ago
The US should just find a way to quietly share literature access with the Russians, rather than letting piracy be promoted and facilitated for US consumers as freedom-fighter "archiving".
Between all the piracy, and all the AI training and the purchase/visitor-circumventing AI services, the practice of writing and publishing genuinely good work is being wiped out.
We're killing the goose that lays the eggs, for selfish gain.
[-]
- TFNA 1 hour ago
  This ship has sailed for academic publications, and academics define that term very liberally because we want to read everything, fiction included. The shadow libraries started off as a way for scholars in ex-Soviet countries in particular (but also India, SE Asia, etc.) to access literature that simply wasn’t available in their country. But the shadow libraries proved so successful and convenient that academics in all countries are using them now, even if they have access to official subscription services. I use AA several times a day and so do the researchers around me in my office; at conferences, if the presenter mentions an interesting publication, the whole room immediately opens AA on their laptops, etc.
  Even if projects like AA didn’t have nation-level support, academics would find a way to keep as much of it as possible going. After all, we’re the ones who compiled the bulk of pre-2020 material, and we’re the ones who do all the hard work of scanning from our institutional libraries stuff that doesn’t exist anywhere in digital form.
- logicchains 32 minutes ago
  >the practice of writing and publishing genuinely good work is being wiped out.
  Most of the best literature in the English language was written before modern IP law was even a thing. There's very little good literature written by authors primarily motivated by money.
  [-]
  - Jtarii 7 minutes ago
    How much of that literature was written by wealthy landowners who already had little need for money?
  - boca_honey 17 minutes ago
    That's just cultural elitism. I hope you meet someone in your life who finds absolute joy in reading young adult romance novels or D&D fantasy books so you can understand how irrelevant "good" literature is. I love Dostoevsky and Verne (and D&D novels, especially those written by R.A. Salvatore), but I would never judge the modern "IPs" that got my daughter into reading.
    > best literature
    What does that even mean?
- mjburgess 1 hour ago
  Possibly but this act of governmental self-harm is useful to The People. We live in a world where if your valuation is ~1T you can more or less just do what you like. And the work of The People is stolen from you and launderd.
  In such a world, isnt it useful that governments are stupid enough to give adversaries reasons to undermine it? When the government props up a corporate tyranny domestically, and racketeering, should we make a temporary alliance with all its enemies?
  (Eg., the provision to AI companies of all corporate secretes and competitive practices via prompts, eventually to be used against their capital interests and their labour interests).
  [-]
  - LearnYouALisp 51 minutes ago
    So when will the American people form an "Incorporation" to lobby against business for them?
- WarmWash 41 minutes ago
  >We're killing the goose that lays the eggs, for selfish gain
  We already did that when the internet collectively agreed decades ago that everything digital should be free for anyone.
  We're now 20 years downstream of ad-blocking being a virtuous good, and piracy being the ultimate show of liberty, and now suddenly everyone cares about the creator's revenue stream.
  The mask slipped and unsurprisingly the internet is a bunch of selfish morally stunted children. Some of them even pushing 50 years old.
  Yes, I am talking to you with the 4TB of pirated content, proud of not loading any ads in the last 15 years, and getting enraged over LLM training.
  [-]
  - lelanthran 13 minutes ago
    > Yes, I am talking to you with the 4TB of pirated content, proud of not loading any ads in the last 15 years, and getting enraged over LLM training.
    That's oddly-specific :-)
    In any case, I have no pirated content that I know off, neither proud nor ashamed of blocking ads[1], but I still get annoyed that a bunch of VCs can use their invested-into companies to launder all the worlds IP, then sell it back to them.
    [1] Who feels proud of blocking ads? It's like feeling proud of tying your shoelaces: "Good job, well done, but that's the expectation, son".
anyaya1 20 minutes ago
Does Anna's Archive use a completely different "source repository" from LibGen?
[-]
- takipsizad 9 minutes ago
  annas archive is pratically a compilation from all sources (including libgen afaik)
delichon 1 hour ago
It seems like bounties for new sources of training data would be useful to the big model builders. I follow a guy who hoards vast quantities of old analog media of all kinds, a lot of it local. Bounties could be a way for him to cash in. But I'm not sure if it's an appreciating asset or if they'll find it anyway and it'll lose its value.
wxw 2 hours ago
Some more interesting bounties they offer: https://software.annas-archive.gl/AnnaArchivist/annas-archiv...
> Purchase all Library of Congress MARC datasets — $3,000 bounty
> English Wikipedia pages about relevant institutions — up to $100 per new page
> Internet Archive Digital Lending — $5000 per 1 million pdf files
> Text version of our full library — $20,000
...
[-]
- Cider9986 36 minutes ago
  Up to 500k for OPSEC failures is interesting, as well. It gives me hope that there are wealthy individuals contributing to sharing books, or many small donations.
  https://software.annas-archive.gl/AnnaArchivist/annas-archiv...
FerritMans 2 hours ago
So AA is a front for openai?
[-]
- flexagoon 36 minutes ago
  No, but they openly make a lot of money from selling their library to AI companies. Fast enterprise access to Anna's Archive starts at $100.000
- 650REDHAIR 1 hour ago
  How did you come to that conclusion?
- awakeasleep 1 hour ago
  the bounty would be a bit higher with openAI money behind it

Curious as to how you would approach this. I have no experience in this area, anyone on this forum willing to share their expertise?

[-]

0x3f 1 hour ago

If it works as AA seems to theorize, you'd need to:

  (a) work out how Google books exposes fragments of books, and see if there's a systematic way of using this to get whole books.  For example, a naive approach might be to find any fragment of the book by searching some exact phrase.  Then, you can search for an exact phrase from the start or end of the fragment it gave you, hoping it will show you the previous or next part of the book.  You can then just loop that to get the whole book.

  (b) once you have (a), you need a way of bypassing Google's bot detection/rate limiting.  I don't know what current state of the art is, but there may be a solution for sale out there.  E.g. you pay to receive a cookie or browser state, and use that to fetch the URLs from (a).  Or if you're good/already in the scene, you could do this part yourself.

[-]

takipsizad 36 minutes ago
That way definitely will work with the current access google provides however its an extremely inconvenient way to scrape google books

ThrowawayTestr 2 hours ago
One of my hopes is that when the AI bubble bursts, some brave person will sneak out a copy of the last frontier model.
[-]
- Aboutplants 2 hours ago
  Not worried about that, you will only have to wait 3-6 months and get a Chinese model just as good.
  [-]
  - sulam 1 hour ago
    That’s misunderstanding why these models are behind. A large part of why they’re behind is they aren’t able to do the reinforcement learning post-training steps that takes a pre-trained model and turns it into a frontier model like GPT 5 or Opus. Instead they do their best to recreate these models using distillation.
    Fundamentally, you can never distill your way to being the teacher, so these approaches will not advance the frontier.
    [edit, after thinking about it I think my phrasing is unfair. It's not necessarily that aren't able to do it, but they haven't yet shown that they are willing to do it.]
    [-]
    - computerex 59 minutes ago
      That’s not remotely true. They did distillation as a cheap solution to the cold start problem. You need data/trajectories to hill climb to higher capabilities. All large Chinese labs do RLAIF.
      [-]
      - sulam 52 minutes ago
        Oh yes, not remotely true. Which is why the frontier labs all have invested heavily in trying to identify and thwart distillers, using known company names / domains to drive their exclusion lists.
        /s
        [-]
        logicchains 29 minutes ago
        It's cheaper to distill than to do reinforcement learning, so of course they prefer that, but if it wasn't an option they could just pay up and spend more GPU time on RL.
    - FpUser 59 minutes ago
      >"they aren’t able to do the reinforcement learning post-training steps"
      Not yet.
      If there is a need someone will come and fulfill. Personally for me now I do not even want to use top models. Professionally I use AI to help with the coding using Junie agent that comes with IDEs from JetBrains. Junie is told to use Gemini Flash and works fine for what I ("I" being an emphasis here) ask it to do. I tried more advanced models and different vendors only to discover credits going down the toilet without any extra benefit.
      [-]
      - sulam 52 minutes ago
        I'll agree I guess and clarify that the better phrasing is probably something like "haven't yet shown the capability to."
  - yorwba 2 hours ago
    Chinese companies giving away expensive models for free is a symptom of the AI bubble, too. It's not a law of nature that they'll always be able to scrounge up the money for yet another training run.
    [-]
    - jnwatson 30 minutes ago
      As long as it is in the CCP's national interest to have a frontier model, Chinese companies will have the resources for another training run.
    - gpm 2 hours ago
      Shaping the tool that does the thinking is quite valuable when you're in the business of changing how people think - I think we can expect propaganda agencies to be subsidizing model creation forever.
      This doesn't strike me as a symptom of a bubble - except in so far as the bubble pushes the competitors models forwards and thus they need to invest more to stay competitive.
      [-]
      - rvnx 1 hour ago
        All the models, have to respect their local laws, and most of all, pressure from users and the employees.
        They all carry political weights, because humans behind defend their interests, and are promoting some social values.
        https://pastebin.com/hjhvsBFg
        This answer from Claude is so biased that it is ridiculous
    - nextos 2 hours ago
      I think it's a deliberate business strategy of commoditization of their complement.
      China acts like an entire bloc, not as single companies, and they want to monetize hardware.
- fastball 1 hour ago
  If it's a bubble, why do you care about frontier models?
  [-]
  - FpUser 56 minutes ago
    Internet was a bubble, so was telecom etc. at some point. Being bubble does not mean that when 90% of investments go down the drain the remains are not useful.
- thx67 1 hour ago
  Prediction markets can solve this.
- zuzululu 48 minutes ago
  which will be very difficult to run unless you have a large budget to operate your own mini datacenter
  [-]
  - lelanthran 9 minutes ago
    In a crash the hardware will go for pennies on the dollar, if not for fractions of pennies on the dollar.
    Lots of companies will pick them up for scrap metal prices and host them for fractions of what we are paying today.
    That's the nature of bubbles.
b112 2 hours ago
[dead]