
2·
1 day agoHi every one, maybe I’m a bit late to this, but I wanted to share my findings. I parsed every page up to 40k in DS9 3 times and results matched by distribution with PeoplesElbow findings (no content after page 14k and a lot of dublications) BUT I parsed 4 times more unique urls 246_079 (still 2x short of official size). And a strange thing is that on second pass (one day after the first one) I started receiving new urls on old pages.
Here is stat by file type:
count | file type
--------+------
1 | ts
8 | mov
236 | mp4
244326 | pdf
73 | m4a
1 | vob
1 | docx
1 | doc
9 | m4v
1422 | avi
1 | wmv
Finally got my hands on original DS9 OPT file and I have started downloading files from it. Don’t know how long it will take. Also made a git with stats and index files from doj website and opt from archive: https://github.com/ArzymKoteyko/JEDatasets In short the only difference is that I got additional 1753 links to video files and a strange .docx file with size of 0 bytes [EFTA00335487.docx].