嫩草官网 啦啦啦免费观看视频6 国自产精品手机在线视频 影视大全 两个人的免费视频完整版 嫩草视频 骚虎视频入口 国自产精品手机在线视频 好大哥dgdg 四虎影院 影视大全 骚虎视频入口 四虎影院 影视大全 国自产精品手机在线视频 嫩草视频 嫩草视频 嫩草视频 两个人的免费视频完整版 国自产精品手机在线视频 好大哥dgdg 国自产精品手机在线视频 影视大全高清版 骚虎官网 四虎影院 野花在线观看免费视频 骚虎官网 影视大全 影视大全免费追剧 24小时日本在线视频 国自产精品手机在线视频 啦啦啦免费观看视频6 两个人的免费视频完整版 啦啦啦免费观看视频6 国自产精品手机在线视频 24小时日本在线视频

New data science platform speeds up Python queries

Data queries written in Python, a commonly used programming language, can grind data analytics platforms to a crawl, but a new platform developed by researchers from Brown and MIT may finally solve the Python efficiency problem.

PROVIDENCE, R.I. [Brown University] — Researchers from Brown University and MIT have developed a new data science framework that allows users to process data with the programming language Python — without paying the “performance tax” normally associated with a user-friendly language.

The new framework, called Tuplex, is able to process data queries written in Python up to 90 times faster than industry-standard data systems like Apache Spark or Dask. The research team unveiled the system in research presented at SIGMOD 2021, a premier data processing conference, and have made the software freely available to all.?

“Python is the primary programming language used by people doing data science,” said Malte Schwarzkopf, an assistant professor of computer science at Brown and one of the developers of Tuplex. “That makes a lot of sense. Python is widely taught in universities, and it’s an easy language to get started with. But when it comes to data science, there’s a huge performance tax associated with Python because platforms can’t process Python efficiently on the back end.”

Platforms like Spark perform data analytics by distributing tasks across multiple processor cores or machines in a data center. That parallel processing allows users to deal with giant data sets that would choke a single computer to death. Users interact with these platforms by inputting their own queries, which contain custom logic written as “user-defined functions” or UDFs. UDFs specify custom logic, like extracting the number of bedrooms from the text of a real estate listing for a query that searches all of the real estate listings in the U.S. and selects all the ones with three bedrooms.?

Because of its simplicity, Python is the language of choice for creating UDFs in the data science community. In fact, the Tuplex team cites a recent poll showing that 66% of data platform users utilize Python as their primary language. The problem is that analytics platforms have trouble dealing with those bits of Python code efficiently.?

Data platforms are written in high-level computer languages that are compiled before running. Compilers are programs that take computer language and turn it into machine code — sets of instructions that a computer processor can quickly execute. Python, however, is not compiled beforehand. Instead, computers interpret Python code line by line while the program runs, which can mean far slower performance.?

“These frameworks have to break out of their efficient execution of compiled code and jump into a Python interpreter to execute Python UDFs,” Schwarzkopf said. “That process can be a factor of 100 less efficient than executing compiled code.”

If Python code could be compiled, it would speed things up greatly. But researchers have tried for years to develop a general-purpose Python compiler, Schwarzkopf says, with little success. So instead of trying to make a general Python compiler, the researchers designed Tuplex to compile a highly specialized program for the specific query and common-case input data. Uncommon input data, which account for only a small percentage of instances, are separated out and referred to an interpreter.?

“We refer to this process as dual-case processing, as it splits that data into two cases,” said Leonhard Spiegelberg, co-author of the research describing Tuplex. “This allows us to simplify the compilation problem as we only need to care about a single set of data types and common-case assumptions. This way, you get the best of two worlds: high productivity and fast execution speed.”

And the runtime benefit can be substantial.?

“We show in our research that a wait time of 10 minutes for an output can be reduced to a second,” Schwarzkopf said. “So it really is a substantial improvement in performance.”

In addition to speeding things up, Tuplex also has an innovative way of dealing with anomalous data, the researchers say. Large datasets are often messy, full of corrupted records or data fields that don’t follow convention. In real estate data, for example, the number of bedrooms could either be a numeral or a spelled-out number. Inconsistencies like that can be enough to crash some data platforms. But Tuplex extracts those anomalies and sets them aside to avoid a crash. Once the program has run, the user then has the option of repairing those anomalies.?

“We think this could have a major productivity impact for data scientists,” Schwarzkopf said. “To not have to run out to get a cup of coffee while waiting for an output, and to not have a program run for an hour only to crash before it’s done would be a really big deal.”

The research was supported by the National Science Foundation (DGE-2039354, IIS-1453171) and the U.S. Air Force (FA8750-19-2-1000).?

Tags Research
西瓜影视_西瓜影音_西瓜电影电视剧在线观看-西瓜网-西瓜影院 骚虎视频官方-房事视频免费看 四虎视频官方-女人本色高清视频 午夜视频官方-24小时日本在线视频 野花在线官方-影视大全 久草视频入口-嫩草视频官网 小草视频入口-两个人观看野花免费视频 嫩草视频入口-大片免费视频 好大哥视频-好大哥dgdg