I have some good PDF ebooks I’m willing to share, but I suspect the seller embeds some tracking data in them to link them to my account, as every time I download them from the official website they have a different hash while being visually identical. The same when checking against the copies a friend bought from the same seller. Since I dont wanna get banned, can you recommend a way to remove that stuff?

    • 0x4E4F@vlemmy.net
      link
      fedilink
      English
      arrow-up
      12
      ·
      1 year ago

      qpdf is very powerful. If OP is comfortable with the terminal, I’d recommend qpdf.

  • Shizu@lemmy.world
    link
    fedilink
    English
    arrow-up
    20
    arrow-down
    1
    ·
    1 year ago

    I would try reprinting the PDFs and comparing the hashes afterwards. That should remove any metadata in the headers as new headers are created.

    • bionicjoey@lemmy.ca
      link
      fedilink
      English
      arrow-up
      9
      ·
      1 year ago

      That wouldn’t work for something like Pathfinder PDFs from the Paizo website. They add a text watermark with the name and email associated with your account on their site to each page of the document. It’s not metadata, it’s actual data

      • Shizu@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 year ago

        Why would the checksum differ between downloads if there was a watermark with user identifiable data

        • bionicjoey@lemmy.ca
          link
          fedilink
          English
          arrow-up
          4
          ·
          edit-2
          1 year ago

          Just checked one of my Paizo pdfs and in addition to my account name and email address it also has the datetime that I downloaded the pdf written in the watermark. Presumably because they append the file creation time when the pdf is being signed

          • Shizu@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            1 year ago

            Fair, then reprinting won’t help. I’d go ahead and come up with some Python script which exported all pages as png, edited that specific portion of every image and recompile it to a pdf. I’m not sure if there is a too which could already do that out-of-the-box.

            • bionicjoey@lemmy.ca
              link
              fedilink
              English
              arrow-up
              1
              ·
              1 year ago

              Unfortunately then you lose things like text and links. I think the only real solution for my specific example (which to be clear, might not be OP’s dilemma) is to crack and directly edit the binary data of the PDF file

  • gh0stkey@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    ·
    1 year ago

    Wow… The amount of information already being shared here is outstanding! Keep on rowing/patching mates

  • arr@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    9
    ·
    1 year ago

    You can of course remove the metadata, but you can’t really be sure you removed all watermarks hidden in the actual content, unless you can make two downloads from different sources have the same hash with whatever method you’re going to use. That way you’d know for certain that you caught whatever was inserted to identify you. Anything other than metadata will be very hard to find and remove in an automated way unless you already know exactly what you’re looking for though.

    That said, this is how I’ve cleaned up metadata in batches of PDF files using qpdf and exiftool in the past:

    for file in *.pdf; do
        exiftool -all:all= -overwrite_original "$file"
        qpdf --linearize --replace-input "$file"
    done
    
  • thumbman@lemm.ee
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    2
    ·
    1 year ago

    Okay hear me out… physically print the documents then, using a high resolution scanner, make a digital copy and finally use a raster to vector convertor.

    I know this is probably dumb, but I just wanted to throw this out there.

    • 0x4E4F@vlemmy.net
      link
      fedilink
      English
      arrow-up
      9
      ·
      edit-2
      1 year ago

      Why not just primt it to PDF. It doesn’t lose any data, plus it doesn’t take ages to scan the books.

  • bbbhltz@beehaw.org
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    Exiftool can remove metadata. There might even be websites that can handle this.