9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] csv files -> embarrasing
@ 2006-04-28 14:36 Steve Simon
  2006-04-28 14:43 ` quanstro
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Steve Simon @ 2006-04-28 14:36 UTC (permalink / raw)
  To: 9fans

Ok, I have spent half an hour trying to parse CSV files
and it's getting embarrasing, I could do it in C but I should
be able to use rc + sed + awk.

The problem is that some of my CSV files fields contain whitespace
and thus have double quotes around them.

I thought rc knows about %q quotes strings so I could use it to
do my parsing, but it fails, can this be done, or is C the answer?
seems a shame to resort to sledge hammers.

-Steve

cpu% cat file.csv
a,b,"c,d,e",f,g
p,q,r,s,t

cpu%
cpu% cat extract
#!/bin/rc

sed 's/"([^"]*)"/''\1''/g; s/,/ /g' $* |
	while (s=`{read})
		echo $s(1) $s(3) $s(4)


cpu% extract file.csv
a 'c d
p r s



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] csv files -> embarrasing
  2006-04-28 14:36 [9fans] csv files -> embarrasing Steve Simon
@ 2006-04-28 14:43 ` quanstro
  2006-04-28 16:29 ` Russ Cox
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: quanstro @ 2006-04-28 14:43 UTC (permalink / raw)
  To: 9fans

i had this problem years ago processing original yahoo categories.
i think the solution is to use regexps to do a "lifting". first
translate the commas within "..." to, say, "☺".  then do your csv
operation and finally translate the "☺" back to ",".

this is totally untested, but something like this:

# should be f* and f*' 

fn fstar {
	{
	 	echo 'X ,x:"[^"]*": s:,:☺:g'
	  	 echo ,p
	} | sam -d $*
}

fn fstartick {
	{
	 	echo 'X ,x:"[^"]*": s:☺:,:g'
	  	 echo ,p
	} | sam -d $*
}

for(i in $files)
	fstar $i | extract | fstartick /fd/0

- erik


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] csv files -> embarrasing
  2006-04-28 14:36 [9fans] csv files -> embarrasing Steve Simon
  2006-04-28 14:43 ` quanstro
@ 2006-04-28 16:29 ` Russ Cox
  2006-04-28 18:28 ` lucio
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Russ Cox @ 2006-04-28 16:29 UTC (permalink / raw)
  To: 9fans

You can change CSV to tab separated in sam/acme with

,y/[^,]+|"[^"]+"/ s/,/[tab]/g

Replace [tab] with an actual tab.

Russ



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] csv files -> embarrasing
  2006-04-28 14:36 [9fans] csv files -> embarrasing Steve Simon
  2006-04-28 14:43 ` quanstro
  2006-04-28 16:29 ` Russ Cox
@ 2006-04-28 18:28 ` lucio
  2006-04-30 10:36   ` matt
  2006-06-01 13:53 ` Victor Nazarov
  2006-06-02  3:09 ` Rogelio Serrano
  4 siblings, 1 reply; 9+ messages in thread
From: lucio @ 2006-04-28 18:28 UTC (permalink / raw)
  To: 9fans

> Ok, I have spent half an hour trying to parse CSV files
> and it's getting embarrasing, I could do it in C but I should
> be able to use rc + sed + awk.

A reliable CSV reader would be very useful.  I discovered that CSV can
be awkward, but I didn't think that was enough justification for never
having found any such tool.

Is parsing CSV really so difficult?

++L



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] csv files -> embarrasing
  2006-04-28 18:28 ` lucio
@ 2006-04-30 10:36   ` matt
  0 siblings, 0 replies; 9+ messages in thread
From: matt @ 2006-04-30 10:36 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

 > Is parsing CSV really so difficult?

depends on your CSV

1, "2", "3""3","4, 4", "5\",5"

which I interpret as (\n separated) :
1
2
3"3
4, 4
5",5


If you've ever worked with Excel CSV you know what CSV hell is like





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] csv files -> embarrasing
  2006-04-28 14:36 [9fans] csv files -> embarrasing Steve Simon
                   ` (2 preceding siblings ...)
  2006-04-28 18:28 ` lucio
@ 2006-06-01 13:53 ` Victor Nazarov
  2006-06-01 16:06   ` rog
  2006-06-02  3:09 ` Rogelio Serrano
  4 siblings, 1 reply; 9+ messages in thread
From: Victor Nazarov @ 2006-06-01 13:53 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Steve Simon wrote:

>Ok, I have spent half an hour trying to parse CSV files
>and it's getting embarrasing, I could do it in C but I should
>be able to use rc + sed + awk.
>
>The problem is that some of my CSV files fields contain whitespace
>and thus have double quotes around them.
>
>I thought rc knows about %q quotes strings so I could use it to
>do my parsing, but it fails, can this be done, or is C the answer?
>seems a shame to resort to sledge hammers.
>
>-Steve
>
>cpu% cat file.csv
>a,b,"c,d,e",f,g
>p,q,r,s,t
>
>cpu%
>cpu% cat extract
>#!/bin/rc
>
>sed 's/"([^"]*)"/''\1''/g; s/,/ /g' $* |
>	while (s=`{read})
>		echo $s(1) $s(3) $s(4)
>
>
>cpu% extract file.csv
>a 'c d
>p r s
>
>
>  
>
Thought about this case today. In native Plan9 the solution is quite easy.
Programs share environment and this is the answer:

sed 's/"([^"]*)"/''\1''/g; s/,/ /g' $* |
	while (s=`{read}) {
		echo 's=('$"s')' | rc
		echo $s(1) $s(3) $s(4)
        }





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] csv files -> embarrasing
  2006-06-01 13:53 ` Victor Nazarov
@ 2006-06-01 16:06   ` rog
  0 siblings, 0 replies; 9+ messages in thread
From: rog @ 2006-06-01 16:06 UTC (permalink / raw)
  To: 9fans

> sed 's/"([^"]*)"/''\1''/g; s/,/ /g' $* |
> 	while (s=`{read}) {
> 		echo 's=('$"s')' | rc
> 		echo $s(1) $s(3) $s(4)
>	}

unfortunately this doesn't work, for quite a few reasons.
1) the sed script doesn't deal with quoted double-quotes.
2) nor does it deal with single quotes.
2) the `{read} idiom tokenizes s, ignoring multiple spaces, so
that information is lost when putting them back together with $"s
3) the environment might be shared, but rc caches environment
variables, so the value of $s when passed to the second
echo is the same as that before the rc invocation.

i've also encountered newlines in values in csv files,
which won't help matters.
assuming that there aren't any newlines, an approximation
to a solution might be:

sed -e 's/''/''''/g' -e 's/"(([^"]|"")*)"/''\1''/g' -e 's/""/"/g' -e 's/.*/s=(&)/' |
	ifs='' while(e=`{read}){
		eval $e
		echo $s(1) $s(3) $s(4)
	}

i really don't like using eval in this way though. if you get it wrong,
you've got a nasty loophole.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] csv files -> embarrasing
  2006-04-28 14:36 [9fans] csv files -> embarrasing Steve Simon
                   ` (3 preceding siblings ...)
  2006-06-01 13:53 ` Victor Nazarov
@ 2006-06-02  3:09 ` Rogelio Serrano
  4 siblings, 0 replies; 9+ messages in thread
From: Rogelio Serrano @ 2006-06-02  3:09 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 4/28/06, Steve Simon <steve@quintile.net> wrote:
> Ok, I have spent half an hour trying to parse CSV files
> and it's getting embarrasing, I could do it in C but I should
> be able to use rc + sed + awk.
>
> The problem is that some of my CSV files fields contain whitespace
> and thus have double quotes around them.
>
> I thought rc knows about %q quotes strings so I could use it to
> do my parsing, but it fails, can this be done, or is C the answer?
> seems a shame to resort to sledge hammers.
>
> -Steve
>
> cpu% cat file.csv
> a,b,"c,d,e",f,g
> p,q,r,s,t
>
> cpu%
> cpu% cat extract
> #!/bin/rc
>
> sed 's/"([^"]*)"/''\1''/g; s/,/ /g' $* |
>         while (s=`{read})
>                 echo $s(1) $s(3) $s(4)
>
>
> cpu% extract file.csv
> a 'c d
> p r s
>
>

I deal with csv files everyday at work and i just use custom plain c
parsers. Reinventing the wheel and all.

-- 
www.smsglobal.net SMS Global Ltd Short Message Service For Seafarers


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [9fans] csv files -> embarrasing
@ 2006-05-01  0:34 erik quanstrom
  0 siblings, 0 replies; 9+ messages in thread
From: erik quanstrom @ 2006-05-01  0:34 UTC (permalink / raw)
  To: 9fans

ooks slightly more painful than the previous case, but i think that
using my original trick, one can continue to add elements to the
lifing pipeline and move character-stuffed " and \" out of the way, too.

and didn't i learn that trick well from parsing data from yahoo categories.

- erik

On Sun Apr 30 05:38:13 CDT 2006, mattmobile@proweb.co.uk wrote:
>  > Is parsing CSV really so difficult?
>
> depends on your CSV
>
> 1, "2", "3""3","4, 4", "5\",5"
>
> which I interpret as (\n separated) :
> 1
> 2
> 3"3
> 4, 4
> 5",5
>
>
> If you've ever worked with Excel CSV you know what CSV hell is like


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-06-02  3:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-28 14:36 [9fans] csv files -> embarrasing Steve Simon
2006-04-28 14:43 ` quanstro
2006-04-28 16:29 ` Russ Cox
2006-04-28 18:28 ` lucio
2006-04-30 10:36   ` matt
2006-06-01 13:53 ` Victor Nazarov
2006-06-01 16:06   ` rog
2006-06-02  3:09 ` Rogelio Serrano
2006-05-01  0:34 erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).