caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Create Array of floats from string
@ 2017-04-26 10:48 Jon Kleiser
  2017-04-26 11:02 ` rixed
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jon Kleiser @ 2017-04-26 10:48 UTC (permalink / raw)
  To: caml-list

Hi,

I am quite new to OCaml, and I am looking for the most efficient way to make an Array of floats from string. My solution this far looks like this, where dims is a global variable specifying the size of the Arrays (typically 300):

let make_vector vec_strings =
  let vec = Array.make !dims 0.0 in
  List.iteri (fun i str -> vec.(i) <- float_of_string str) vec_strings

let process_line line =
  let parts = Str.split (Str.regexp " ") line in
  make_vector (List.tl parts)	(* skipping first element which is not a float *)

/Jon

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Create Array of floats from string
  2017-04-26 10:48 [Caml-list] Create Array of floats from string Jon Kleiser
@ 2017-04-26 11:02 ` rixed
  2017-04-26 13:36   ` Francois BERENGER
       [not found] ` <CAPFanBGh0q2AaF7ROWJJF81o=8+79sn-q4-CxqCKGQ__Oa5SEw@mail.gmail.com>
  2017-04-26 15:27 ` [Caml-list] Create Array of floats from string Alain Frisch
  2 siblings, 1 reply; 10+ messages in thread
From: rixed @ 2017-04-26 11:02 UTC (permalink / raw)
  To: Jon Kleiser; +Cc: caml-list

If speed is more important than readability I would avoid creating the
intermediary list of strings and go with Str.search_forward, then extract that
string and convert it to float.

If the separator is as simple as a space I'd expect String.index_from to be
slightly faster. Also consider BatString.find_all that returns an enum (of
starting positions - unfortunately we do not have BatString.split variant
returning an enum of substrings directly).

If that's not enough you'd have to use another library to parse the string, one
that would implement something like `float_of_string_from` so that you do not
have to build all those useless substrings.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Create Array of floats from string
  2017-04-26 11:02 ` rixed
@ 2017-04-26 13:36   ` Francois BERENGER
  0 siblings, 0 replies; 10+ messages in thread
From: Francois BERENGER @ 2017-04-26 13:36 UTC (permalink / raw)
  To: OCaml Mailing List

On 04/26/2017 06:02 AM, rixed@happyleptic.org wrote:
> If speed is more important than readability I would avoid creating the
> intermediary list of strings and go with Str.search_forward, then extract that
> string and convert it to float.
>
> If the separator is as simple as a space I'd expect String.index_from to be
> slightly faster. Also consider BatString.find_all that returns an enum (of
> starting positions - unfortunately we do not have BatString.split variant
> returning an enum of substrings directly).

Could be welcome into batteries.

> If that's not enough you'd have to use another library to parse the string, one
> that would implement something like `float_of_string_from` so that you do not
> have to build all those useless substrings.

This one too, BatFloat.of_string_from
but we need a little bit more specification to know what's the expected 
behavior (especially, when does the float part of the string ends?).

You can go there for any feature request:
https://github.com/ocaml-batteries-team/batteries-included/issues

Regards,
F.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Create Array of floats from string
       [not found] ` <CAPFanBGh0q2AaF7ROWJJF81o=8+79sn-q4-CxqCKGQ__Oa5SEw@mail.gmail.com>
@ 2017-04-26 14:05   ` Jon Kleiser
  2017-04-26 15:26     ` Gabriel Scherer
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Kleiser @ 2017-04-26 14:05 UTC (permalink / raw)
  To: caml-list

Thanks a lot, Gabriel, for your idea about using the “word by word” method. This far I have used the Stream way of file reading:

let line_stream_of_channel channel =
  Stream.from
    (fun _ -> try Some (input_line channel) with End_of_file -> None)

Can this Stream reading make use of the scanf to read floats (and other words)? If not, I may leave the Stream way.

I would also like to have access to the current number of lines received, to be able to report that so-and-so was found at line number x. This far I have not found out how count the lines while reading from a Stream.

/Jon


> On 26. Apr, 2017, at 15:41, Gabriel Scherer <gabriel.scherer@gmail.com> wrote:
> 
> It looks like you read a line from an input channel and now want to split it on its spaces. It is also possible to read the input channel word by word in the first place, and for this the semantics of spaces in a scanf format is very useful: a single space ignores all whitespace. So
> 
> let read_float () =
>   Scanf.scanf " %f" (fun x -> x)
> 
> will ignore any whitespace and then expect a floating-point number, read it and return it. (This reads from standard input, to read from arbitrary channels see Scanf.bscanf and the Scanf.Scanning module).
> 
> On Wed, Apr 26, 2017 at 6:48 AM, Jon Kleiser <jon.kleiser@ceres.no> wrote:
> Hi,
> 
> I am quite new to OCaml, and I am looking for the most efficient way to make an Array of floats from string. My solution this far looks like this, where dims is a global variable specifying the size of the Arrays (typically 300):
> 
> let make_vector vec_strings =
>   let vec = Array.make !dims 0.0 in
>   List.iteri (fun i str -> vec.(i) <- float_of_string str) vec_strings
> 
> let process_line line =
>   let parts = Str.split (Str.regexp " ") line in
>   make_vector (List.tl parts)   (* skipping first element which is not a float *)
> 
> /Jon
> 
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Create Array of floats from string
  2017-04-26 14:05   ` Jon Kleiser
@ 2017-04-26 15:26     ` Gabriel Scherer
  2017-04-27 14:00       ` [Caml-list] Create Array of floats from string, surprise Jon Kleiser
  0 siblings, 1 reply; 10+ messages in thread
From: Gabriel Scherer @ 2017-04-26 15:26 UTC (permalink / raw)
  To: Jon Kleiser; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 3473 bytes --]

> Can this Stream reading make use of the scanf to read floats (and other
words)?

Not really (although you can make do with Scanf.Scanning.from_function :
(unit -> char) -> Scanning.in_channel;
https://caml.inria.fr/pub/docs/manual-ocaml/libref/Scanf.Scanning.html ).

If counting the line number is important to you, it makes sense to keep
using input_line, instead of scanning " %f" directly on the channel (as
this may skip arbitrarily many newlines) but then you can still use it to
scan each line as a string:

  let line = input_line channel in
  let scanbuf = Scanf.Scanning.from_string line in
  incr line_number;
  Scanf.bscanf "%s@ " ignore;
  let vec = Array.init !dims (fun _ -> Scanf.bscanf scanbuf " %f" (fun x ->
x)) in
  ...

(the format "%s@c" means "scan a string until the character (c) excluded,
so "%s@ " consumes the first word.)

On Wed, Apr 26, 2017 at 10:05 AM, Jon Kleiser <jon.kleiser@ceres.no> wrote:

> Thanks a lot, Gabriel, for your idea about using the “word by word”
> method. This far I have used the Stream way of file reading:
>
> let line_stream_of_channel channel =
>   Stream.from
>     (fun _ -> try Some (input_line channel) with End_of_file -> None)
>
> Can this Stream reading make use of the scanf to read floats (and other
> words)? If not, I may leave the Stream way.
>
> I would also like to have access to the current number of lines received,
> to be able to report that so-and-so was found at line number x. This far I
> have not found out how count the lines while reading from a Stream.
>
> /Jon
>
>
> > On 26. Apr, 2017, at 15:41, Gabriel Scherer <gabriel.scherer@gmail.com>
> wrote:
> >
> > It looks like you read a line from an input channel and now want to
> split it on its spaces. It is also possible to read the input channel word
> by word in the first place, and for this the semantics of spaces in a scanf
> format is very useful: a single space ignores all whitespace. So
> >
> > let read_float () =
> >   Scanf.scanf " %f" (fun x -> x)
> >
> > will ignore any whitespace and then expect a floating-point number, read
> it and return it. (This reads from standard input, to read from arbitrary
> channels see Scanf.bscanf and the Scanf.Scanning module).
> >
> > On Wed, Apr 26, 2017 at 6:48 AM, Jon Kleiser <jon.kleiser@ceres.no>
> wrote:
> > Hi,
> >
> > I am quite new to OCaml, and I am looking for the most efficient way to
> make an Array of floats from string. My solution this far looks like this,
> where dims is a global variable specifying the size of the Arrays
> (typically 300):
> >
> > let make_vector vec_strings =
> >   let vec = Array.make !dims 0.0 in
> >   List.iteri (fun i str -> vec.(i) <- float_of_string str) vec_strings
> >
> > let process_line line =
> >   let parts = Str.split (Str.regexp " ") line in
> >   make_vector (List.tl parts)   (* skipping first element which is not a
> float *)
> >
> > /Jon
> >
> > --
> > Caml-list mailing list.  Subscription management and archives:
> > https://sympa.inria.fr/sympa/arc/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
> >
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 5005 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Create Array of floats from string
  2017-04-26 10:48 [Caml-list] Create Array of floats from string Jon Kleiser
  2017-04-26 11:02 ` rixed
       [not found] ` <CAPFanBGh0q2AaF7ROWJJF81o=8+79sn-q4-CxqCKGQ__Oa5SEw@mail.gmail.com>
@ 2017-04-26 15:27 ` Alain Frisch
  2017-04-27  8:36   ` Jon Kleiser
  2017-04-27  9:15   ` Jon Kleiser
  2 siblings, 2 replies; 10+ messages in thread
From: Alain Frisch @ 2017-04-26 15:27 UTC (permalink / raw)
  To: Jon Kleiser, caml-list

On 26/04/2017 12:48, Jon Kleiser wrote:
> let make_vector vec_strings =
>   let vec = Array.make !dims 0.0 in
>   List.iteri (fun i str -> vec.(i) <- float_of_string str) vec_strings
>
> let process_line line =
>   let parts = Str.split (Str.regexp " ") line in
>   make_vector (List.tl parts)	(* skipping first element which is not a float *)

Since OCaml 4.04, you have:

   let parts = String.split_on_char ' ' line in

-- Alain

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Create Array of floats from string
  2017-04-26 15:27 ` [Caml-list] Create Array of floats from string Alain Frisch
@ 2017-04-27  8:36   ` Jon Kleiser
  2017-04-27  9:15   ` Jon Kleiser
  1 sibling, 0 replies; 10+ messages in thread
From: Jon Kleiser @ 2017-04-27  8:36 UTC (permalink / raw)
  To: caml-list

Thanks, Alain, maybe String.split_on_char is faster than Str.split (Str.regexp " "), but one has to be aware of that while

# #load "str.cma";;
# Str.split (Str.regexp " ") " a b c ";;
- : string list = ["a"; "b"; "c"]

then

# String.split_on_char ' ' " a b c ";;
- : string list = [""; "a"; "b"; "c"; ""]

I my case the lines in the input file ends with a space, so I will have to adjust my code to ignore that.

/Jon

> On 26. Apr, 2017, at 17:27, Alain Frisch <alain.frisch@lexifi.com> wrote:
> 
> On 26/04/2017 12:48, Jon Kleiser wrote:
>> let make_vector vec_strings =
>>  let vec = Array.make !dims 0.0 in
>>  List.iteri (fun i str -> vec.(i) <- float_of_string str) vec_strings
>> 
>> let process_line line =
>>  let parts = Str.split (Str.regexp " ") line in
>>  make_vector (List.tl parts)	(* skipping first element which is not a float *)
> 
> Since OCaml 4.04, you have:
> 
>  let parts = String.split_on_char ' ' line in
> 
> -- Alain


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Create Array of floats from string
  2017-04-26 15:27 ` [Caml-list] Create Array of floats from string Alain Frisch
  2017-04-27  8:36   ` Jon Kleiser
@ 2017-04-27  9:15   ` Jon Kleiser
  2017-04-28 12:19     ` Jon Kleiser
  1 sibling, 1 reply; 10+ messages in thread
From: Jon Kleiser @ 2017-04-27  9:15 UTC (permalink / raw)
  To: caml-list

I now adjusted my code to ignore the space at the end of each input line, and using
String.split_on_char ' ' line
instead of
Str.split (Str.regexp " ") line
speeds up my program from taking 25.8 secs to now only 17.4 secs. That’s quite a bit. Thanks, Alain!

/Jon

> On 26. Apr, 2017, at 17:27, Alain Frisch <alain.frisch@lexifi.com> wrote:
> 
> On 26/04/2017 12:48, Jon Kleiser wrote:
>> let make_vector vec_strings =
>>  let vec = Array.make !dims 0.0 in
>>  List.iteri (fun i str -> vec.(i) <- float_of_string str) vec_strings
>> 
>> let process_line line =
>>  let parts = Str.split (Str.regexp " ") line in
>>  make_vector (List.tl parts)	(* skipping first element which is not a float *)
> 
> Since OCaml 4.04, you have:
> 
>  let parts = String.split_on_char ' ' line in
> 
> -- Alain


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Create Array of floats from string, surprise
  2017-04-26 15:26     ` Gabriel Scherer
@ 2017-04-27 14:00       ` Jon Kleiser
  0 siblings, 0 replies; 10+ messages in thread
From: Jon Kleiser @ 2017-04-27 14:00 UTC (permalink / raw)
  To: caml-list

Hi Gabriel,

I have now figured out how to use the Scanf.bscanf as you suggest. The disappointment and big surprise, however, is that my program using ‘Scanf.bscanf’ is significantly slower than the earlier one based on ‘String.split_on_char’ and ‘List.iteri’, about 43.5 secs vs. 17.6 secs.
Thanks anyway. I feel I have learned quite some OCaml by doing this.

/Jon


> On 26. Apr, 2017, at 17:26, Gabriel Scherer <gabriel.scherer@gmail.com> wrote:
> 
> > Can this Stream reading make use of the scanf to read floats (and other words)?
> 
> Not really (although you can make do with Scanf.Scanning.from_function : (unit -> char) -> Scanning.in_channel; https://caml.inria.fr/pub/docs/manual-ocaml/libref/Scanf.Scanning.html ).
> 
> If counting the line number is important to you, it makes sense to keep using input_line, instead of scanning " %f" directly on the channel (as this may skip arbitrarily many newlines) but then you can still use it to scan each line as a string:
> 
>   let line = input_line channel in
>   let scanbuf = Scanf.Scanning.from_string line in
>   incr line_number;
>   Scanf.bscanf "%s@ " ignore;
>   let vec = Array.init !dims (fun _ -> Scanf.bscanf scanbuf " %f" (fun x -> x)) in
>   ...
> 
> (the format "%s@c" means "scan a string until the character (c) excluded, so "%s@ " consumes the first word.)
> 
> On Wed, Apr 26, 2017 at 10:05 AM, Jon Kleiser <jon.kleiser@ceres.no> wrote:
> Thanks a lot, Gabriel, for your idea about using the “word by word” method. This far I have used the Stream way of file reading:
> 
> let line_stream_of_channel channel =
>   Stream.from
>     (fun _ -> try Some (input_line channel) with End_of_file -> None)
> 
> Can this Stream reading make use of the scanf to read floats (and other words)? If not, I may leave the Stream way.
> 
> I would also like to have access to the current number of lines received, to be able to report that so-and-so was found at line number x. This far I have not found out how count the lines while reading from a Stream.
> 
> /Jon
> 
> 
> > On 26. Apr, 2017, at 15:41, Gabriel Scherer <gabriel.scherer@gmail.com> wrote:
> >
> > It looks like you read a line from an input channel and now want to split it on its spaces. It is also possible to read the input channel word by word in the first place, and for this the semantics of spaces in a scanf format is very useful: a single space ignores all whitespace. So
> >
> > let read_float () =
> >   Scanf.scanf " %f" (fun x -> x)
> >
> > will ignore any whitespace and then expect a floating-point number, read it and return it. (This reads from standard input, to read from arbitrary channels see Scanf.bscanf and the Scanf.Scanning module).
> >
> > On Wed, Apr 26, 2017 at 6:48 AM, Jon Kleiser <jon.kleiser@ceres.no> wrote:
> > Hi,
> >
> > I am quite new to OCaml, and I am looking for the most efficient way to make an Array of floats from string. My solution this far looks like this, where dims is a global variable specifying the size of the Arrays (typically 300):
> >
> > let make_vector vec_strings =
> >   let vec = Array.make !dims 0.0 in
> >   List.iteri (fun i str -> vec.(i) <- float_of_string str) vec_strings
> >
> > let process_line line =
> >   let parts = Str.split (Str.regexp " ") line in
> >   make_vector (List.tl parts)   (* skipping first element which is not a float *)
> >
> > /Jon
> 
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Caml-list] Create Array of floats from string
  2017-04-27  9:15   ` Jon Kleiser
@ 2017-04-28 12:19     ` Jon Kleiser
  0 siblings, 0 replies; 10+ messages in thread
From: Jon Kleiser @ 2017-04-28 12:19 UTC (permalink / raw)
  To: caml-list

In case anybody wants to take a look, I have put my two program versions, the fast one and the slow one, here:

<http://folk.uio.no/jkleiser/ocaml/read_vec.ml>
<http://folk.uio.no/jkleiser/ocaml/scan_vec.ml>

The fast one, which uses ‘String.split_on_char’ and ‘List.iteri’, reads a 1.35 GB file in about 18 secs on my Mac, while the slower one, which uses ‘Scanf.bscanf’, reads the same file in about 43 secs.

If I have made some stupid things that makes the slower one so slow, than I’d be glad to hear how to fix it, just to learn a bit more OCaml.

The file that I use as input, is the wiki.no.vec that you can find here:
<https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.no.vec>

If you would like to play with other files in the same format, you find them here:
<https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md>

/Jon

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-04-28 12:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-26 10:48 [Caml-list] Create Array of floats from string Jon Kleiser
2017-04-26 11:02 ` rixed
2017-04-26 13:36   ` Francois BERENGER
     [not found] ` <CAPFanBGh0q2AaF7ROWJJF81o=8+79sn-q4-CxqCKGQ__Oa5SEw@mail.gmail.com>
2017-04-26 14:05   ` Jon Kleiser
2017-04-26 15:26     ` Gabriel Scherer
2017-04-27 14:00       ` [Caml-list] Create Array of floats from string, surprise Jon Kleiser
2017-04-26 15:27 ` [Caml-list] Create Array of floats from string Alain Frisch
2017-04-27  8:36   ` Jon Kleiser
2017-04-27  9:15   ` Jon Kleiser
2017-04-28 12:19     ` Jon Kleiser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).