• Knock_Knock_Lemmy_In@lemmy.world
    link
    fedilink
    arrow-up
    8
    ·
    5 days ago

    Andrew Ng did a video when he gradually added noise to the training audio to improve the quality.

    But here we are dealing with homophones so it’s not just turning speech to text, it also needs to be context aware.

    Possible but too expensive to implement automatically.

      • Knock_Knock_Lemmy_In@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        5 days ago

        I’m highlighting that speech to text and context awareness are different skills.

        YouTube is unlikely to waste loads of compute power on subtitles that don’t need it just to capture the occasional edge case.

        • lime!@feddit.nu
          link
          fedilink
          English
          arrow-up
          2
          ·
          5 days ago

          i mean, it’s a one-time-per-video thing. they already do tons of processing on every upload.

            • lime!@feddit.nu
              link
              fedilink
              English
              arrow-up
              2
              ·
              4 days ago

              right now they’re dynamically generating subtitles every time. that’s way more compute.

              • aow@sh.itjust.works
                link
                fedilink
                arrow-up
                1
                ·
                4 days ago

                For real? That’s incredibly dumb/expensive compared to one subtitle roll. Can you share where you saw that?

                • lime!@feddit.nu
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  edit-2
                  4 days ago

                  well, i have no evidence of this. however. looking at the way auto-generated subtitles are served at youtube right now, they are sent individually word-by-word from the server, pick up filler words like “uh”, and sometimes pause for several seconds in the middle of sentences. and they’re not sent by websocket, which means they go through multiple requests over the course of a video. more requests means the server works harder because it can’t just stream the text like it does the video, and the only reason they’d do that other than incompetence (which would surely have been corrected by now, it’s been like this for years) is if the web backend has to wait for the next word to be generated.

                  i would love to actually know what’s going on if anyone has any insight.