• 6 Posts
  • 18 Comments
Joined 1 year ago
cake
Cake day: June 17th, 2023

help-circle
  • Personally I think child processes are the right approach for this. Launch a new process* for each query and it can (if you choose to go that route) dynamically load in compiled code. Exit when you’re done, and the dynamically loaded code is gone. A side benefit of that is memory leaks are contained, since all memory you allocate is about to be removed anyway.

    I’d probably be fine with hundreds or thousands of these hanging in memory. I suspect the generated code for a single query would be in hundreds of kilobytes, maybe a megabyte. But yeah, this is one of those technical details I’d worry about.

    Honestly, I wonder if you could just use an actual HTTP server for this? They can handle hundreds or even thousands of simultaneous requests. They can handle requests that complete in a fraction of a millisecond or ones that run for several hours. And they have good tools to catch/deal with code that segfaults, hits an endless loop, attempts to allocate terabytes of swap, etc. HTTP also has wonderful tools to load balance across multiple servers if you do need to scale to massive numbers of requests.

    Not sure how a HTTP server would solve the CPU bottleneck of scanning terabytes of data per query?








  • Kernel is not a monolithic application, and you cannot develop it like one. There are tons of actors: independent developers, small support companies (like Collabora), corporations, all with different priorities. There is a large number of independent forks (e.g. for obscure devices), that will never be merged, but need to merge e.g. security patches from the mainline. A single project management tool won’t do, not your typical business grade tracking&reporting tool.

    CI is already there. Not a central one—again, distributed across different organizations. Different organizations have different needs for CI, e.g. supporting weird architectures that they need to develop against.

    There is a reason Torvalds created git—existing tools just wouldn’t work. There might be a place for a similar revolution regarding a bugtracker…






  • Another idea that just occurred to me. Maybe position: absolute; both the real content and the gibberish content with the same top, left, width, and height attributes so that the real content and the gibberish overlap and occupy the same location on the page. Make sure both the real and gibberish content elements have no background so that remains clear. Put the gibberish content in the DOM before the real content. (I think that will ensure that the gibberish appears behind the real content even without setting the z-index.) And then make JS set the color of the text in the gibberish element the same color as the background so humans can’t see it.

    Be aware that these techniques can affect accessibility for people using screen readers.




  • liori@lemm.eetodatahoarder@lemmy.mlCloud Backup
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    1 year ago

    Yep, it’s EU. File transfer shouldn’t be bad if your files are large, though it’s best if you tested it first—it might depend on your ISP’s peering and your prefered transfer protocols/tooling. Whether it’s reputable for your purpose, you probably have to do your own research. Also, remember that the offer I mentioned would only be equivalent in durability to a single-box RAID5 for your purposes, so not exactly equivalent to Google’s.


  • liori@lemm.eetodatahoarder@lemmy.mlCloud Backup
    link
    fedilink
    arrow-up
    2
    ·
    1 year ago

    There’s Jottacloud with unlimited storage for 10 EUR/month, but they gradually slow down after first 5 TB. 30 TB might be a bit too much. There’s Hetzner with their dedicated 4×10TB machines for ~52 EUR, you could do RAID5 and have somewhat redundant 30 TB, at the cost of self-managing a dedicated machine. There are several providers doing regular S3 (which you can take advantage of with tools like rclone) with decent redundancy for 4-5 USD/TB + egress. For high-value data you should be probably spending more than 100 USD/month for 30TB in the cloud, or invest in actual hardware. Do you need hot access to this dataset, or is a cold storage archive enough?



  • One reason (among many) is that employment in American companies is less stable than in Europe with strong employment laws. Twitter could not do the same type of layoffs in Europe, with stories like this one being pretty common. But this safety net has a cost, and the cost is a part of total employment cost for employers. Whether the safety net is worth it for employees in IT, that’s another matter—but it can’t not be taken into account because of the law.

    BTW, in some European countries there is a strong culture of IT workers doing long-term contractor work exactly to trade off employment laws for (usually quite a lot) higher wage.


  • Given these criteria, ggplot2 wins by a landslide. The API, thanks to R’s nonstandard evaluation feature, is crazy good compared to whatever is available in Python. Not having to use numpy/pandas as inputs is a bonus as well, somehow pandas managed to duplicate many bad features of R’s data frame and introduce its own inconsistences, without providing many of the good features¹. Styling defaults are decent, definitely much better than matplotlib’s, and it’s much easier to consistently apply custom styling. Future of ggplot2 is defined by downstream libraries, ggplot2 is just the core of the ecosystem, which, at this point, is mature and stable. Matplotlib’s activity is mostly because that lack of nonstandard evaluation makes it more cumbersome to implement flexible APIs, and so it just takes more work. Both have very minimal support for interactive and web, it’s easier to just use shiny/dask to wrap them than to force them alone to do web/interactive stuff. Which, btw, again I’d say shiny » dask if nothing but for R’s nonstandard evaluation feature.

    Note though that learning proper R takes time, and if you don’t know it yet, you will underestimate time necessary to get friendly. Nonstandard evaluation alone is so nonstandard that it gives headaches to people who’d otherwise be skilled programmers already. matplotlib would hugely win by flexibility, which you apparently don’t need—but there’s always that one tiny tweak you would wish to be able to do. Also, it’s usually much easier to use the platform’s default, whatever publishing platform you’re going to use.

    As for me, if I have choice, I’m picking ggplot2 as a default. So far it was good enough for significant majority of my academic and professional work.

    ¹ Admitably numpy was not designed for data analysis directly, and pandas has some nice features missing from R’s data frames.




  • We found that flakiness of e2e tests is usually caused by using libraries, frameworks and devops tools that were not designed for being integrated in e2e tests. So we try to get rid of them, or at least wrap them with devops magic. This requires a skilled devops team and buy-in from management.

    At some point we were also solving the issue by having dedicated human reviews of e2e failures, it’s easy to train a junior QA engineer to have most false positives quickly retried.

    But we would never give up on e2e.