apache spark - Filter rows by distinct values in one column in PySpark -

let's have following table:

+--------------------+--------------------+------+------------+--------------------+ |                host|                path|status|content_size|                time| +--------------------+--------------------+------+------------+--------------------+ |js002.cc.utsunomi...|/shuttle/resource...|   404|           0|1995-08-01 00:07:...| |    tia1.eskimo.com |/pub/winvn/releas...|   404|           0|1995-08-01 00:28:...| |grimnet23.idirect...|/www/software/win...|   404|           0|1995-08-01 00:50:...| |miriworld.its.uni...|/history/history.htm|   404|           0|1995-08-01 01:04:...| |      ras38.srv.net |/elv/delta/uncons...|   404|           0|1995-08-01 01:05:...| | cs1-06.leh.ptd.net |                    |   404|           0|1995-08-01 01:17:...| |dialip-24.athenet...|/history/apollo/a...|   404|           0|1995-08-01 01:33:...| |  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:35:...| |  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...| |  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...| |  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...| |  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...| |  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...| |  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...| |  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...| |  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...| |  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...| |hsccs_gatorbox07....|/pub/winvn/releas...|   404|           0|1995-08-01 01:44:...| |www-b2.proxy.aol....|/pub/winvn/readme...|   404|           0|1995-08-01 01:48:...| |www-b2.proxy.aol....|/pub/winvn/releas...|   404|           0|1995-08-01 01:48:...| +--------------------+--------------------+------+------------+--------------------+

how filter table have distinct paths in pyspark? table should contains columns.

if want save rows values in specific column distinct, have call dropduplicates method on dataframe. in example:

dataframe = ...  dataframe.dropduplicates(['path'])

where path column name

Thr

Search This Blog

apache spark - Filter rows by distinct values in one column in PySpark -

Comments

Post a Comment